A cell-core-attachment approach for identifying protein complexes in yeast protein-protein interaction network

Abstract

With the increasing of available protein-protein interaction (PPI) data, many computational methods have been explored to identify protein complexes from PPI networks. Majority of algorithms employ the feature of local neighbors to detect local dense subgraphs which correspond to protein complexes. Those approaches neglect the inherent core-attachment structure of protein complexes, which to an extent affect the protein complexes of prediction accuracy. In this paper, we propose a new algorithm for predicting protein complexes, deriving from the framework of the core-attachment. The proposed method first obtains the triangular structures of the core of protein complexes, name as cells, in which the edge-clustering coefficient is used. And then the cells are expanded to protein complex cores based on the closeness. Finally, the attachments are added to their corresponding cores to form the final protein complexes. The experimental results on two yeast PPI data show our method outperform the existing algorithms in terms of matched protein complexes and biological significance using two benchmark data sets.

Keywords

Core-attachment protein complex protein-protein interaction triangular structure

1 Introduction

In the past several years, large-scale high-through experimental data have been produced from different organisms [1], which inspires the researchers’ interesting. For example, Li et al. employed the parallel strategy to align multiple biological sequences [2, 3] and protein structures [4]. Liang et al. [5] revealed the evolving processes of yeast protein-protein interaction (PPI) network by network motifs to the age-dependency. More importantly, identification of protein complexes contributes to our understanding of the organization and interaction of cellular processes, which is useful in system-level understanding of biological organization.With the increasing of biological experimental data, many methods are developed to detect protein complexes [6 –8], which play an important role in performing a special cellular functions. For example, the RNA polymerase II complex transcribes genetic information into messages for ribosomes to produce proteins [9]. The nuclear pore complex protects the exchange of components between the nucleus and cytoplasm, preventing the transport of material that unintentionally crosses the nuclear membrane [10].

Recently, numerous computational methods are used to systematically elucidate and analyze the large-scale PPI network, in which a vertex represents a protein and an edge denotes the interaction between a pair of proteins. These computational methods are based on the assumption that proteins in one complex have more interactions among themselves than others. Generally, majority of Computational methods mainly focus on detecting highly connected subgraphs as protein complexes in the PPI networks [11 –15]. Molecular complex detection (MCODE) [11] starts with high weighted protein, where is calculated by the density of their local neighborhoods. It predicts protein complexes by extending seed protein. CFinder [12] utilizes the cliques to define a dense subgraph, which is discovered by merging a set of adjacent k-cliques if they share (k - 1) common nodes. MCL [13] proposed the concept of random “walks” to predict protein complexes. An efficient and robust variation of MCL, known as R-MCL [14], was proposed in 2010; R-MCL regularized operation and balances parameter to refine the precision of detected protein complexes. ‘Soft’ R-MCL (SR-MCL) [15] is another algorithm that can produce overlapping protein complexes.

Although these methods can effectively predict protein complexes in PPI network, the inherent architecture of complexes are often neglected. Previous studies [16] uncovers a protein complex is generally composed of a core and attachments. A protein core is a small group of proteins with highly connected each other and high functional similarity. The core is the key functional unit of the complex and largely determines its cellular function and significance [16, 17]. Based on the core-attachment structure of protein complex, Leung et al. [18] proposed the CORE algorithm that predicts protein-complex cores based on a statistical framework. This approach used topology information of the PPI network to determine if two proteins are in the same protein-complex core. Wu et al. [19] introduced the COACH algorithm derived from the structure of core-attachment, which defined the core vertices among the neighborhood graphs, and then identified the preliminary cores of the protein complexes. This method added the attachments into the complex cores to form protein complexes. Cao et al. [20] proposed a novel multiObjective evolutionary programming genetic algorithm which integrates multiple network topological features to detect biologically meaningful protein complexes. It can serve as a powerful framework to detect protein complexes, which experimental results outperform those methods with single network topological characteristic.

In this study, motivated by interactive features among the nodes in communication network, we propose a novel algorithm based on the core-attachment method. In the core-attachment framework, a smaller stable core of core, named as cell, may reasonably be included into the cores of protein complexes. Considering the invariance of the statistical properties of triangular structure in the scale-free networks, including PPI network [21], our algorithm mainly consists of three steps: first, our method obtains the stable cells as the heart of the cores by using the edge-clustering coefficient (ECC). Second, these cells are extended to the complex cores by filtering the candidate set. In this process, we propose a novel measure rule to quantify the relationship between the candidate nodes and the complex cores. Finally, we add attachments into the complex cores to generate final protein complexes. To evaluate the efficiency and quality of our method, we apply it to two yeast PPI network, and verify our results using two public datasets of reference protein complexes. We compare our approaches to several state-of-the-art algorithms such as MCL [13], ClusterONE [1], MCODE [11], SPICi [22], COACH [19] and Core [18]. The experimental results show that our method achieves higher accuracy than the existing computational methods. Moreover, the cells and cores of the predicted protein complexes show high functional homogeneity and biological similarity. Figure 1 presents the framework of our study.

2 Methods

A PPI network can be modeled as an unweighted undirected graph G = (V, E), in which the vertex set V represents the proteins and the edge set E represents the set of interactions between pairs of proteins.

2.1 Edge-clustering coefficient (ECC)

ECC characterizes the “closeness” between the two connected nodes and other nodes around the edge. Radicchi et al. [23] first proposed the clustering coefficient of the edge in a complex network. According to the ECC, we can measure the topological property of the PPI network. Higher ECC indicates the edge is more important in the network. The ECC is defined as Equation (1) [23]: $ECC (u, v) = \frac{Z_{uv}}{min (d_{u} - 1, d_{v} - 1)}$ (1)

In Equation (1), Z_uv is the number of triangles built on the edge, d_u and d_v represents the degrees of nodes u and v in the network, respectively. From Equation (1), we can infer that the value of ECC is between 0 and 1.

The importance of nodes can be quantified by the sum of ECC (SoECC) of the adjacent edges,which is presented as Equation (2) [24], where N_u is the neighbor protein set of u. $SoECC (u) = \sum_{v \in N_{u}} ECC (u, v)$ (2)

SoECC quantifies the importance of nodes in the PPI network, which has been widely used in predicting essential proteins. Wang et al. [24] used the ECC to identify the essential proteins in the PPI network. Here, it is used to predict protein complexes in PPI network. Obviously, low SoECC indicates less important nodes. According to Equations (1) and (2), SoECC (u) = 0 means there is no link from other nodes to u. Similarity, ECC (u, v) = 0 means there is no common node connects with node u and node v. From the analysis, we can infer that the node u is impossible to be a part of a cell if SoECC (u) is 0. Therefore, we filter these nodes with SoECC = 0, and then save the remaining nodes in SN using Algorithm 1, which is described in Table 1.

2.2 Generating the cells

A triangular structure is composed of three nodes and three edges. In the scale-free network, the inherent statistical properties of triangular structure are relatively solid [21]. Interestingly, the stable structure is also widely used in microcosmic concepts, including chemical bonds and molecular structure. In this study, we regard the triangular structure as the center of core.

It is note that, the triangular structures, which are composed of three nodes with high SoECC and the corresponding edges, are achieved by ranking the SoECC value. As cell, the special structures are of key meanings for protein complex cores because of its stability and expansibility. Algorithm 2 gives the detailed description (in Table 2). In algorithm 2, the SN set is achieved from algorithm 1, and then we construct a cell containing three nodes with the highest SoECC values and their corresponding edges, the CELL is added into the set of cells, named SC. To achieve non-overlapping cells in SC, we remove the three nodes in cell from SN set, until the size of SN is less than 3, finally the SC set is outputted.

2.3 Constructing protein complex cores

The cells are the primary part of the complex cores. Therefore, we need to expand the cells to form the complex cores. We place those direct neighbors of the proteins in the cells into set N to form the set of candidate cores, and descending sort the set by SoECC values. We then compute the closeness between each candidate and the core. Finally, the candidates with closeness higher than the threshold are appended to the current core. When all of the candidate nodes in set N are traversed, the protein complex cores are generated.

In this study, topological property of the PPI network is measured by ECC, which is used to obtain the cells. On the basis of the ECC, we introduce a novel standard of measurement to determine the strong connection between the candidate vertex v to the core set. The closeness of a candidate node to the complex core is defined as Equation (3): $Closeness (v, Core) = m \times \frac{\sum_{u \in core} Ecc (v, u)}{\sum_{v 1, v 2 \in Core} Ecc (v 1, v 2)}$ (3)

In Equation (3), v represents the candidate node; u, v₁, and v₂ denote proteins in the current complex core; m represents the number of the proteins in the current complex core. Algorithm 3 shows the process of constructing cores (in Table 3).

2.4 Detecting attachments and optiming method

In this subsection, to form final protein complex, the core should be extended with the associated attachments. The strategy for finding attachment in Ref. [19] is simple and effective, which is employed to supplement the attachment of complex cores.

Protein complexes usually exhibit high connective rates with each other. To optimize the prediction accuracy of protein complexes, those protein complexes which densities are lower than a specific threshold should be discarded [25].

2.5 Identifying protein complexes

Algorithm 4 (in Table 4) shows the overall framework for detecting protein complexes. The possibility of high overlap between the protein complexes is decreased because of different cells. However, protein complex cores may overlap with each other [16], resulting in a small amount of protein complexes that are duplicate. Thus, the identical and low density protein complexes should be discarded and the refined protein complexes are saved to SPC set.

3 Experiments and results

3.1 Experiment data

We perform our method on two different yeast PPI networks, namely, DIP and Gavin. The DIP [26] network consists of 17,201 interactions among 4,930 proteins. The Gavin dataset [16] consists of 1,430 proteins and 6,531 interactions. The properties of the two networks are shown in Table 5. To evaluate and analyze the predicted complexes, two benchmark complexes sets are selected as the benchmark set, they are: New_MIPS [27] and CYC2008 [28]. The MIPS benchmark set consists of 428 standard protein complexes from three sources: (I) MIPS [29], (II) Aloy et al., and (III) SGD database [31] based on Gene Ontology (GO) annotations. The CYC2008 contains 408 real protein complexes. GO dataset are downloaded from Ref. [32] (update to March 1, 2014).

3.2 Evaluation measures

3.2.1 Precision, recall and F-measure

Precision, recall, and F-measure are the commonly used in information retrieval and machine learning. Prior to analyze the detailed experimental results, we first describe the corresponding evaluation metrics. The neighborhood affinity (NA) score (p, b) between a predicted complex p = (V_p, E_p) and a benchmark complex b = (V_b, E_b) in the benchmark complex set (as defined in Equation (4)) can be used to determine their match. The predicted and benchmark complex are considered matching if NA (p, b) ≥ω (ω is usually set as 0. 20). $NA (p, b) = \frac{| Vp \cap Vb |^{2}}{| Vp | \times | Vb |}$ (4) $N_{cp} = | {p | p \in P, \exists b \in B, NA (p, b) \geq ω} |$ (5) $N_{cb} = | {b | b \in B, \exists p \in P, NA (p, b) \geq ω} |$ (6)

Let P and B denote the sets of predicted protein complexes using a computational method and the actual benchmark complex, respectively. Let N_cp represents the number of predicted complexes that match at least one real complex, whereas N_cb be the number of real complexes that match at least one predicted complex. F-measure is the harmonic mean of precision and recall, which can be used to evaluate the overall performance of the different techniques [33, 34]. $F - measure = \frac{2 \times precision \times recall}{precision + recall}$ (7)

In Equation (7) $precision = \frac{N_{cp}}{| P |}$ and $recall = \frac{N_{cb}}{| B |}$ [35].

3.2.2 Coverage rate

Coverage rate is applied to evaluate the number of proteins in the real complexes that can be covered by the predicted complexes [27, 36]. Let n is the benchmark complexes and m is the predicted complexes, T_ij (i ≤ n and j ≤ m) is the number of proteins in common between the ith benchmark complex and jth predicted complex [27, 36]. The coverage rate is defined as Equation (8): $CR = \frac{\sum_{i = 1}^{| B |} max {Tij}}{\sum_{i = 1}^{| B |} Ni}$ (8)

Where, N_i is the number of proteins in the ith benchmark complex.

3.2.3 P-values (functional homogeneity)

The statistical significance of the occurrence of a protein complex with respect to a given functional annotation can be calculated by the following hypergeometric distribution in Equation (9) [25]: $P - v a l u e = 1 - \sum_{i = 0}^{k - 1} \frac{(\overset{| F |}{i}) (\overset{| V | - | F |}{| C | - i})}{(\overset{| V |}{| C |})}$ (9)

In Equation (9), a predicted complex C contains k proteins in the functional group F and the whole PPI network contains |V| proteins.

3.3 Comparison of other algorithms

We compare the performance of our method with six competing algorithms such as MCL [13], ClusterONE [1], MCODE [11], COACH [19], SPICi [22] and Core [18]. Table 6 shows the whole information of predictions on DIP and Gavin datasets.

From Table 6, MCL predicted 1246 complexes, of which 212 match 256 real complexes. ClusterONE detected 343 complexes, of which 119 match 157 real complexes, whereas MCODE identified 60 complexes, of which 31 match 61 real complexes. SPICi predicted 583 complexes, of which 132 match 209 real complexes. The COACH method detected 746 complexes, of which 285 match 249 real complexes and Core discovered 1722 ones, of which 221 match 256 real complexes. Our method predicted 433 protein complexes, of which 234 match 221 real complexes in the benchmark. With DIP data in CYC2008, MCL found 1246 complexes, of which 254 match 255 real complexes. In predicted complexes by ClusterONE, MCODE, SPICi, COACH and Core algorithm, there are 146, 34, 153, 311 and 255 predicted complexes match 134, 39, 177, 215 and 251 real complexes, respectively. Our algorithm discovered 433 complexes, of which 250 match 174 real complexes. The same analysis can be used in Gavin datasets.

Figure 2 shows the overall comparison of F-measure and coverage rate in DIP and Gavin datasets using New_MIPS and CYC2008, respectively. As shown in Fig. 2(A) (left), F-measure of our proposed method is 52.8%, which is 26.3%, 17.1%, 30.5%, 21.9%, 6.7% and 31.7% higher than MCL, ClusterONE, MCODE, SPICi, COACH and Core in DIP data, respectively. Our method also obtains the highest coverage rate of 39.0%, which is 6.9%, 16.1%, 19.4%, 9.2%, 4.1% and 3.4% higher than MCL, ClusterONE, MCODE, SPICi, COACH and Core in DIP data, respectively. The Fig. 2(A) (right) demonstrates that our method achieves the best F-measure and coverage rate matching CYC2008 benchmark complexes set. In Fig. 2(B) (left), our algorithm outperform all algorithms except for MCL, we clearly see that tour method is slightly higher than MCL in coverage rate (1.9%). In F-measure, our method is higher (3.9%, 6.3%, 16.3%, 4.3%, 0.5% and 12.3%) than MCL, ClusterONE, MCODE, SPICi, COACH and Core respectively.

Similarity, in right Fig. 2(B), our algorithm also achieves the best performance on the F-measure and coverage rate. In addition all algorithms achieved better performance in Gavin data than those in the DIP, one possible reason is that the average numbers of interactions for each protein are 5.29, 10.62 in DIP and Gavin datasets [37], respectively, which means Gavin data is denser than DIP data. From above observations we can draw a conclusion that our method outperformed achieved perfect performance in all data sets in term of most considered measures.

We also evaluated the performances of all algorithms in term of matching-rate NA [38]. Tables 7 and 8 present the evaluation results in DIP and Gavin dataset using the New_MIPS and CYC2008 benchmark data set, respectively. The ”Total” denotes the number of total complexes discovered by each approach In Table 7, COACH detected most complexes matching with real complexes when ω= 0.2. When ω∈ [0.6, 0.8], our algorithm identified 82, 57 and 34 protein complexes matched with New_MIPS in DIP data, which is higher than MCL, ClusterONE, MCODE, SPICi, COACH and Core. From Table 8, we could find COACH detected comparable number complexes matching real complexes with our method when ω= 0.2, and our method predicted more matching protein complexes than other algorithms when ω∈ [0.6, 0.8]. These observations further demonstrate the validity of our method.

Figure 3 shows how the various methods detect the SAGA complex [39]. In Fig. 3, Fig. I represent the real SAGA complex in the New_MIPS benchmark, which contains 20 proteins, and Fig. II–VI exhibits the set of proteins detected by different methods from SAGA complex. Red nodes represent the proteins matched proteins in the real SAGA complex and blue ones represent false matched proteins. Our method can cover more proteins in the real SAGA complex than other methods. There are 17 proteins were predicted by our method, of which 16 proteins can match with the real ones in benchmark. COACH, ClusterONE, SPICi, MCL and Core can only cover 11,9, 8, 6 and 2 proteins of the real SAGA complex, respectively.

For Core, there are only 2 proteins matched with real complex, which is far fewer than other algorithms, so the corresponding protein set is not presented in our study.

3.4 P-value comparison

We also calculate the p-value of complexes predicted by various algorithms. The functional homogeneity of p-value represents the probability of co-occurrence of common functions proteins. The proteins in same protein complex are more like to perform common biological functions, thus they are expected to share common functions [19]. Therefore, a predicted complex with a low p-value denotes the low probability of the collective occurrence of these proteins in the complex, and thus achieves high statistical significance. In our experiments, the p-values of protein complexes are computed with the SGD’s GO::Term Finder [40]. Generally, we consider a complex that is biologically significant, when its p-value is less than 0.01 [41].

Table 9 shows the comparison results of the biological significance in DIP and Gavin datasets. In DIP dataset, 88.5% of complexes detected by our method are significant, and our method also achieved the highest proportion of significant complexes, which is 63.4%, 16.8%, 0.2%, 30.2%, 5.1% and 72.4% higher than MCL, ClusterONE, MCODE, SPICi COACH and Core, respectively. The results for Gavin dataset are similar to those in DIP dataset. From Table 9, we can analyzes our method is more efficient for identifying significant protein complexes than other algorithms. Table 10 shows some examples with lowest p-values are predicted by our method in DIP and Gavin data, respectively. The “NA” refers to the NA scores between predicted complexes and real complexes. The ”Common protein” means the number of proteins in the real complexes that are correctly covered by predicted complexes.

3.5 Biological similarity

We further analyze the biological significances of the complex cores using the GO annotation data (BP, biological process; CC, cellular component, and MF, molecular function). Pair-wise protein interactions can have similar scores based on their GO terms. In our experiments, we calculate the functional similarity between two proteins with the method in Refs. [41, 42].

Table 11 shows the average similarity of different interactions using three sub-ontologies of GO (BP,CC and MF), and the rightmost column represents the average similarity of the three sub-ontologies. In the DIP data, the average similarity of cells in BP, CC and MF is 0.285, 0.414 and 0.226, respectively. We achieved the average of three sub-ontologies of 0.308. From Table 11, we can see that the average similarity of whole network is lower than those of cells, complex cores and complexes, the similarities of interactions in cells and cores are higher than those in complexes. These findings indicate that cores have higher biological significance than the protein complex and the whole DIP network. It further demonstrates the cells are of important meaning.

3.6 Effects of the parameters

3.6.1 The closeness threshold of k and edge clustering coefficient of m

To investigate the effects of k on the performance of our method, we first study how the algorithm behaves in terms of F-measure and coverage rate. The detailed experimental results are presented in Fig. 4, with the gradual increase in closeness threshold of k, the F-measure shows the stable trend. Interestingly, the coverage rate shows the slight decrease trend with the addition of k.

Figure 5 represents the distribution of ECC (left) and closeness (right) in DIP dataset. We find that for ECC and closeness, a majority of value fall into the interval of [0, 0.1], which account for 60%, 57.9%, respectively. Especially, with the increasing of closeness, the proportion of which is sharply decreasing, it greatly affects the quality of predicted protein complexes. As a result, we set the closeness threshold k = 0.1 in our study.

3.6.2 The density threshold of t

To further analyze the F-measure and coverage rate, we calculate the distribution of the both measures under different density threshold using DIP data set, which is represented in Fig. 6. We clearly see that F-measure exhibits the stable trend when t ∈ [0.1, 0.4], with the increase of density threshold, F-measure gradual decreases when t >0.4. Another condition is that the coverage rate show the stable trend with the increase from 0.1 to 0.2 of t, when t >0.2, the coverage rate always displays the decreasing trend. Therefore, we set t = 0.2 in our experiment.

4 Conclusion

In this study, a novel algorithm using the core-attachment framework, is proposed to identify the protein complex in the PPI networks. This proposed algorithm largely consists of three steps: First, based on the ECC, we achieve the invariant cells as the heart of the cores. Second, we identify the complex cores from the neighborhood graphs of each cell. Finally, we generate protein complexes by adding attachments into the complex cores. In protein complexes identification process, our algorithm considers the topological information of the PPI network and the inherent organization of the protein complexes.

Comprehensive comparisons among the state-of-the-art algorithms and our method have been made in terms of the number of detected protein complexes, biological significance of protein complexes on two yeast PPI networks. Experimental results show that the proposed method achieved higher overall performance than the classic computational methods. Moreover, the cells and cores of predicted protein complexes have high functional homogeneity and biological similarity. It verifies that the cells are the core of biological cores of protein complexes.

However, only few protein complexes can be discovered in some a sparse network, which affects the whole performance of our algorithm. Therefore, our method further need to be improved due to the following reasons: i) Due to the high false-positive and false-negative rates of PPI data produced by high-through experimental techniques, which affects the accuracy of protein complexes identification, limiting the identification of real protein complexes. ii) Our approach focused on the prediction of non-overlapping protein complexes, which fail to consider the condition that a protein may have multiple functions and it may belong to more than one protein complex.

As a future study, we will investigate how to reduce noise of PPI network and improve the reliability of PPI data using some popular methods, such as gene expression, gene ontology similarity, network topology feature, and so on. Since our method mainly detected non-overlapping protein complexes, we also will further improve our method on identification of overlapping protein complexes in PPI networks. In addition, how to exploit our method to find biological meaning functional modules in human PPI network or transcriptional regulatory network is another issue in future researches.

Footnotes

Acknowledgments

This work is supported by National Natural Science Foundation of China (Grant no. 61572180) and Hunan Provincial Natural Science Foundation of China (Grant no. 13JJ2017).

References

Nepusz

, Yu

H.Y.

and Paccanaro

, Detecting overlapping protein complexes in protein-protein interaction networks, Nature Methods9 (2012), 471–U81.

Zhu

X.Y.

, Li

K.L.

and Salah

, A data parallel strategy for aligning multiple biological sequences on multi-core computers, Computers in Biology and Medicine43 (2013), 350–361.

Zhu

X.Y.

, Li

K.L.

, Salah

, Shi

and Li

K.Q.

, Parallel implementation of MAFFT on CUDA-enabled graphics hardware, IEEE-Acm Transactions on Computational Biology and Bioinformatics12 (2015), 205–218.

Ahmad

K.L.

, Salah, PAR-3D-BLAST: A parallel tool for searching and aligning protein structures, Concurrency and Computation: Practice And Experience26 (2013), 1705–1714.

Liang

, Luo

J.W.

and Song

, Network simulation reveals significant contribution of network motifs to the age-dependency of yeast protein-protein interaction networks, Molecular Biosystems10 (2014), 2277–2288.

, Gruhler

, Heilbut

, Bader

G.D.

, et al., Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry, Nature415 (2002), 180–183.

Wong

K.C.

, Peng

C.B.

, Li

and Chan

T.M.

, Herd Clustering: A synergistic data clustering approach using collective intelligence, Applied Soft Computing23 (2014), 61–75.

Chengbin Peng

Z.Z.

, Wong

K.-C

, et al., A Scalable Community Detection Algorithm for Large Graphs Using Stochastic Block Models, Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015), 2015, pp. 2090–2096.

Cramer

, Bushnell

D.A.

, Fu

J.H.

, et al., Architecture of RNA polymerase II and implications for the transcription mechanism, Science288 (2000), 640–649.

10.

Fahrenkrog

and Aebi

, The nuclear pore complex: Nucleocytoplasmic transport and beyond, Nature Reviews Molecular Cell Biology4 (2003), 757–766.

11.

Bader

G.D.

and Hogue

C.W.

, An automated method for finding molecular complexes in large protein interaction networks, Bmc Bioinformatics4 (2003).

12.

Adamcsek

, Palla

, Farkas

I.J.

, et al., CFinder: Locating cliques and overlapping modules in biological networks, Bioinformatics22 (2006), 1021–1023.

13.

Pereira-Leal

J.B.

, Enright

A.J.

and Ouzounis

C.A.

, Detection of functional modules from protein interaction networks, Proteins-Structure Function and Genetics54 (2004), 49–57.

14.

Satuluri

and Parthasarathy

, Scalable graph clustering using stochastic flows: Applications to community discovery, Kdd-09:15th Acm Sigkdd Conference on Knowledge Discovery and Data Mining (2009), 737–745.

15.

Shih

Y.K.

and Parthasarathy

, Identifying functional modules in interaction networks through overlapping Markov clustering, Bioinformatics28 (2012), I473–I479.

16.

Gavin

A.C.

, Aloy

, Grandi

, et al., Proteome survey reveals modularity of the yeast cell machinery, Nature440 (2006), 631–636.

17.

Dezso

, Oltvai

Z.N.

and Barabasi

A.L.

, Bioinformatics analysis of experimentally determined protein complexes in the yeast Saccharomyces cerevisiae, Genome Research13 (2003), 2450–2454.

18.

Leung

H.C.M.

, Xiang

, Yiu

S.M.

and Chin

F.Y.L.

, Predicting protein complexes from PPI data: A core-attachment approach, Journal of Computational Biology16 (2009), 133–144.

19.

, Li

X.L.

, Kwoh

C.K.

and Ng

S.K.

, A core-attachment based method to detect protein complexes in PPI networks, Bmc Bioinformatics10 (2009), 1–16.

20.

Buwen Cao

J.L.

, Liang

, Wang

and Song

, MOEPGA: A novel method to detect protein complexes in yeast protein-protein interaction network based on MultiObjecitve Evolutionary Programming Genetic Algorithm, Computational Biology and Chemistry58 (2015), 173–181.

21.

Takemoto

, Oosawa

and Akutsu

, Structure of n-clique networks embedded in a complex network, Physica a-Statistical Mechanics and Its Applications380 (2007), 665–672.

22.

Jiang

and Singh

, SPICi: A fast clustering algorithm for large biological networks, Bioinformatics26 (2010), 1105–1111.

23.

Radicchi

, Castellano

, Cecconi

, Loreto

and Parisi

, Defining and identifying communities in networks, Proceedings of the National Academy of Sciences of the United States of America101 (2004), 2658–2663.

24.

Wang

, Li

, Wang

J.X.

and Pan

, A new method for identifying essential proteins based on edge clustering coefficient, Bioinformatics Research and Applications6674 (2011), 87–98.

25.

King

A.D.

, Przulj

and Jurisica

, Protein complex prediction via cost-based clustering, Bioinformatics20 (2004), 3013–3020.

26.

Xenarios

, Salwinski

, Duan

X.Q.J.

, Higney

, Kim

S.M.

and Eisenberg

, DIP, the Database of Interacting Proteins: A research tool for studying cellular networks of protein interactions, Nucleic Acids Research30 (2002), 303–305.

27.

Friedel

C.C.

, Krumsiek

and Zimmer

, Bootstrapping the interactome: Unsupervised identification of protein complexes in yeast, Research in Computational Molecular Biology, Proceedings4955 (2008), 3–16.

28.

S.Y.

, Wong

, Turner

, Cho

and Wodak

S.J.

, Up-to-date catalogues of yeast protein complexes, Nucleic Acids Research37 (2009), 825–831.

29.

Mewes

H.W.

, Amid

, Arnold

, et al., MIPS: Analysis and annotation of proteins from whole genomes, Nucleic Acids Research32 (2004), D41–D44.

30.

Aloy

, Bottcher

, et al., Structure-based assembly of protein complexes in yeast, Science303 (2004), 2026–2029.

31.

Dwight

S.S.

, Harris

M.A.

, Dolinski

, et al., Saccharomyces Genome Database (SGD), provides secondary gene annotation using the Gene Ontology (GO), Nucleic Acids Research30 (2002), 69–72.

32.

GeneOntology Database, Http://www.geneontology.org/GO.database.shtml (2014).

33.

Zhang

, Park

B.H.

, Karpinets

and Samatova

N.F.

, From pull-down data to protein interaction networks and complexes with biological relevance, Bioinformatics24 (2008), 979–986.

34.

X.L.

, Foo

C.S.

and Ng

S.K.

, Discovering protein complexes in dense reliable neighborhoods of protein interaction networks, Comput Syst Bioinformatics Conf6 (2007), 157–168.

35.

Chua

H.N.

, Ning

, Sung

W.K.

, Leong

H.W.

and Wong

, Using indirect protein-protein interactions for protein complex prediction, Journal of Bioinformatics and Computational Biology6 (2008), 97–109.

36.

Brohee

and van

, Helden, Evaluation of clustering algorithms for protein-protein interaction networks, Bmc Bioinformatics7 (2006), 1–19.

37.

X.L.

, Wu

, Kwoh

C.K.

and Ng

S.K.

, Computational approaches for detecting protein complexes from protein interaction networks: A survey, Bmc Genomics11(2010), Suppl1:S3.

38.

X.K.

and Gao

, Discovering protein complexes in protein interaction networks via exploring the weak ties effect, Bmc Systems Biology6 (2012), Suppl1:S6.

39.

Grant

P.A.

, Schieltz

, Pray-Grant

M.G.

, Steger

D.J.

, Reese

J.C.

, YatesJ.R. III and Workman

J.L

, A subset of TAFIIs are integral components of the SAGA complex required for nucleosome acetylation and transcriptional stimulation, Cell94 (1998), 45–53.

40.

Boyle

E.I.

, Weng

S.A.

, Gollub

, Jin

, Botstein

, Cherry

J.M.

and Sherlock

, GO::TermFinder - open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes, Bioinformatics20 (2004), 3710–3715.

41.

H.W.

, Su

Z.C.

, Mao

F.L.

, Olman

and Xu

, Prediction of functional modules based on comparative genome analysis and Gene Ontology application, Nucleic Acids Research33 (2005), 2822–2837.

42.

, An information-theoretic definition of similarity, International Conference on Machine Learning98 (1998), 296–304.