Abstract
With the increasing of available protein-protein interaction (PPI) data, many computational methods have been explored to identify protein complexes from PPI networks. Majority of algorithms employ the feature of local neighbors to detect local dense subgraphs which correspond to protein complexes. Those approaches neglect the inherent core-attachment structure of protein complexes, which to an extent affect the protein complexes of prediction accuracy. In this paper, we propose a new algorithm for predicting protein complexes, deriving from the framework of the core-attachment. The proposed method first obtains the triangular structures of the core of protein complexes, name as cells, in which the edge-clustering coefficient is used. And then the cells are expanded to protein complex cores based on the closeness. Finally, the attachments are added to their corresponding cores to form the final protein complexes. The experimental results on two yeast PPI data show our method outperform the existing algorithms in terms of matched protein complexes and biological significance using two benchmark data sets.
Introduction
In the past several years, large-scale high-through experimental data have been produced from different organisms [1], which inspires the researchers’ interesting. For example, Li et al. employed the parallel strategy to align multiple biological sequences [2, 3] and protein structures [4]. Liang et al. [5] revealed the evolving processes of yeast protein-protein interaction (PPI) network by network motifs to the age-dependency. More importantly, identification of protein complexes contributes to our understanding of the organization and interaction of cellular processes, which is useful in system-level understanding of biological organization.With the increasing of biological experimental data, many methods are developed to detect protein complexes [6–8], which play an important role in performing a special cellular functions. For example, the RNA polymerase II complex transcribes genetic information into messages for ribosomes to produce proteins [9]. The nuclear pore complex protects the exchange of components between the nucleus and cytoplasm, preventing the transport of material that unintentionally crosses the nuclear membrane [10].
Recently, numerous computational methods are used to systematically elucidate and analyze the large-scale PPI network, in which a vertex represents a protein and an edge denotes the interaction between a pair of proteins. These computational methods are based on the assumption that proteins in one complex have more interactions among themselves than others. Generally, majority of Computational methods mainly focus on detecting highly connected subgraphs as protein complexes in the PPI networks [11–15]. Molecular complex detection (MCODE) [11] starts with high weighted protein, where is calculated by the density of their local neighborhoods. It predicts protein complexes by extending seed protein. CFinder [12] utilizes the cliques to define a dense subgraph, which is discovered by merging a set of adjacent k-cliques if they share (k - 1) common nodes. MCL [13] proposed the concept of random “walks” to predict protein complexes. An efficient and robust variation of MCL, known as R-MCL [14], was proposed in 2010; R-MCL regularized operation and balances parameter to refine the precision of detected protein complexes. ‘Soft’ R-MCL (SR-MCL) [15] is another algorithm that can produce overlapping protein complexes.
Although these methods can effectively predict protein complexes in PPI network, the inherent architecture of complexes are often neglected. Previous studies [16] uncovers a protein complex is generally composed of a core and attachments. A protein core is a small group of proteins with highly connected each other and high functional similarity. The core is the key functional unit of the complex and largely determines its cellular function and significance [16, 17]. Based on the core-attachment structure of protein complex, Leung et al. [18] proposed the CORE algorithm that predicts protein-complex cores based on a statistical framework. This approach used topology information of the PPI network to determine if two proteins are in the same protein-complex core. Wu et al. [19] introduced the COACH algorithm derived from the structure of core-attachment, which defined the core vertices among the neighborhood graphs, and then identified the preliminary cores of the protein complexes. This method added the attachments into the complex cores to form protein complexes. Cao et al. [20] proposed a novel multiObjective evolutionary programming genetic algorithm which integrates multiple network topological features to detect biologically meaningful protein complexes. It can serve as a powerful framework to detect protein complexes, which experimental results outperform those methods with single network topological characteristic.
In this study, motivated by interactive features among the nodes in communication network, we propose a novel algorithm based on the core-attachment method. In the core-attachment framework, a smaller stable core of core, named as cell, may reasonably be included into the cores of protein complexes. Considering the invariance of the statistical properties of triangular structure in the scale-free networks, including PPI network [21], our algorithm mainly consists of three steps: first, our method obtains the stable cells as the heart of the cores by using the edge-clustering coefficient (ECC). Second, these cells are extended to the complex cores by filtering the candidate set. In this process, we propose a novel measure rule to quantify the relationship between the candidate nodes and the complex cores. Finally, we add attachments into the complex cores to generate final protein complexes. To evaluate the efficiency and quality of our method, we apply it to two yeast PPI network, and verify our results using two public datasets of reference protein complexes. We compare our approaches to several state-of-the-art algorithms such as MCL [13], ClusterONE [1], MCODE [11], SPICi [22], COACH [19] and Core [18]. The experimental results show that our method achieves higher accuracy than the existing computational methods. Moreover, the cells and cores of the predicted protein complexes show high functional homogeneity and biological similarity. Figure 1 presents the framework of our study.
Methods
A PPI network can be modeled as an unweighted undirected graph G = (V, E), in which the vertex set V represents the proteins and the edge set E represents the set of interactions between pairs of proteins.
Edge-clustering coefficient (ECC)
ECC characterizes the “closeness” between the two connected nodes and other nodes around
the edge. Radicchi et al. [23] first proposed
the clustering coefficient of the edge in a complex network. According to the ECC, we can
measure the topological property of the PPI network. Higher ECC indicates the edge is more
important in the network. The ECC is defined as Equation (1) [23]:
In Equation (1), Zuv is the number of triangles built on the edge, du and dv represents the degrees of nodes u and v in the network, respectively. From Equation (1), we can infer that the value of ECC is between 0 and 1.
The importance of nodes can be quantified by the sum of ECC (SoECC) of the adjacent
edges,which is presented as Equation (2)
[24], where Nu is the neighbor
protein set of u.
SoECC quantifies the importance of nodes in the PPI network, which has been widely used in predicting essential proteins. Wang et al. [24] used the ECC to identify the essential proteins in the PPI network. Here, it is used to predict protein complexes in PPI network. Obviously, low SoECC indicates less important nodes. According to Equations (1) and (2), SoECC (u) = 0 means there is no link from other nodes to u. Similarity, ECC (u, v) = 0 means there is no common node connects with node u and node v. From the analysis, we can infer that the node u is impossible to be a part of a cell if SoECC (u) is 0. Therefore, we filter these nodes with SoECC = 0, and then save the remaining nodes in SN using Algorithm 1, which is described in Table 1.
A triangular structure is composed of three nodes and three edges. In the scale-free network, the inherent statistical properties of triangular structure are relatively solid [21]. Interestingly, the stable structure is also widely used in microcosmic concepts, including chemical bonds and molecular structure. In this study, we regard the triangular structure as the center of core.
It is note that, the triangular structures, which are composed of three nodes with high SoECC and the corresponding edges, are achieved by ranking the SoECC value. As cell, the special structures are of key meanings for protein complex cores because of its stability and expansibility. Algorithm 2 gives the detailed description (in Table 2). In algorithm 2, the SN set is achieved from algorithm 1, and then we construct a cell containing three nodes with the highest SoECC values and their corresponding edges, the CELL is added into the set of cells, named SC. To achieve non-overlapping cells in SC, we remove the three nodes in cell from SN set, until the size of SN is less than 3, finally the SC set is outputted.
Constructing protein complex cores
The cells are the primary part of the complex cores. Therefore, we need to expand the cells to form the complex cores. We place those direct neighbors of the proteins in the cells into set N to form the set of candidate cores, and descending sort the set by SoECC values. We then compute the closeness between each candidate and the core. Finally, the candidates with closeness higher than the threshold are appended to the current core. When all of the candidate nodes in set N are traversed, the protein complex cores are generated.
In this study, topological property of the PPI network is measured by ECC, which is used
to obtain the cells. On the basis of the ECC, we introduce a novel standard of measurement
to determine the strong connection between the candidate vertex v to the core set. The
closeness of a candidate node to the complex core is defined as Equation (3):
In Equation (3), v represents the candidate node; u, v1, and v2 denote proteins in the current complex core; m represents the number of the proteins in the current complex core. Algorithm 3 shows the process of constructing cores (in Table 3).
In this subsection, to form final protein complex, the core should be extended with the associated attachments. The strategy for finding attachment in Ref. [19] is simple and effective, which is employed to supplement the attachment of complex cores.
Protein complexes usually exhibit high connective rates with each other. To optimize the prediction accuracy of protein complexes, those protein complexes which densities are lower than a specific threshold should be discarded [25].
Identifying protein complexes
Algorithm 4 (in Table 4) shows the overall framework for detecting protein complexes. The possibility of high overlap between the protein complexes is decreased because of different cells. However, protein complex cores may overlap with each other [16], resulting in a small amount of protein complexes that are duplicate. Thus, the identical and low density protein complexes should be discarded and the refined protein complexes are saved to SPC set.
Experiments and results
Experiment data
We perform our method on two different yeast PPI networks, namely, DIP and Gavin. The DIP [26] network consists of 17,201 interactions among 4,930 proteins. The Gavin dataset [16] consists of 1,430 proteins and 6,531 interactions. The properties of the two networks are shown in Table 5. To evaluate and analyze the predicted complexes, two benchmark complexes sets are selected as the benchmark set, they are: New_MIPS [27] and CYC2008 [28]. The MIPS benchmark set consists of 428 standard protein complexes from three sources: (I) MIPS [29], (II) Aloy et al., and (III) SGD database [31] based on Gene Ontology (GO) annotations. The CYC2008 contains 408 real protein complexes. GO dataset are downloaded from Ref. [32] (update to March 1, 2014).
Evaluation measures
Precision, recall and F-measure
Precision, recall, and F-measure are the commonly used in information retrieval and
machine learning. Prior to analyze the detailed experimental results, we first describe
the corresponding evaluation metrics. The neighborhood affinity (NA) score (p, b)
between a predicted complex p = (Vp, Ep) and a benchmark complex
b = (Vb, Eb) in the benchmark complex set (as defined in Equation (4)) can be used to determine their
match. The predicted and benchmark complex are considered matching if NA (p, b)
≥ω (ω is usually set as 0. 20).
Let P and B denote the sets of predicted protein complexes using a computational method
and the actual benchmark complex, respectively. Let Ncp represents the number
of predicted complexes that match at least one real complex, whereas Ncb be
the number of real complexes that match at least one predicted complex. F-measure is the
harmonic mean of precision and recall, which can be used to evaluate the overall
performance of the different techniques [33,
34].
In Equation (7) and [35].
Coverage rate is applied to evaluate the number of proteins in the real complexes that
can be covered by the predicted complexes [27,
36]. Let n is the benchmark
complexes and m is the predicted complexes,
T
i
j
(i ≤ n and j ≤ m)
is the number of proteins in common between the ith benchmark complex
and jth predicted complex [27, 36]. The coverage rate is defined
as Equation (8):
Where, N i is the number of proteins in the ith benchmark complex.
The statistical significance of the occurrence of a protein complex with respect to a
given functional annotation can be calculated by the following hypergeometric
distribution in Equation (9) [25]:
In Equation (9), a predicted complex C contains k proteins in the functional group F and the whole PPI network contains |V| proteins.
We compare the performance of our method with six competing algorithms such as MCL [13], ClusterONE [1], MCODE [11], COACH [19], SPICi [22] and Core [18]. Table 6 shows the whole information of predictions on DIP and Gavin datasets.
From Table 6, MCL predicted 1246 complexes, of which 212 match 256 real complexes. ClusterONE detected 343 complexes, of which 119 match 157 real complexes, whereas MCODE identified 60 complexes, of which 31 match 61 real complexes. SPICi predicted 583 complexes, of which 132 match 209 real complexes. The COACH method detected 746 complexes, of which 285 match 249 real complexes and Core discovered 1722 ones, of which 221 match 256 real complexes. Our method predicted 433 protein complexes, of which 234 match 221 real complexes in the benchmark. With DIP data in CYC2008, MCL found 1246 complexes, of which 254 match 255 real complexes. In predicted complexes by ClusterONE, MCODE, SPICi, COACH and Core algorithm, there are 146, 34, 153, 311 and 255 predicted complexes match 134, 39, 177, 215 and 251 real complexes, respectively. Our algorithm discovered 433 complexes, of which 250 match 174 real complexes. The same analysis can be used in Gavin datasets.
Figure 2 shows the overall comparison of F-measure and coverage rate in DIP and Gavin datasets using New_MIPS and CYC2008, respectively. As shown in Fig. 2(A) (left), F-measure of our proposed method is 52.8%, which is 26.3%, 17.1%, 30.5%, 21.9%, 6.7% and 31.7% higher than MCL, ClusterONE, MCODE, SPICi, COACH and Core in DIP data, respectively. Our method also obtains the highest coverage rate of 39.0%, which is 6.9%, 16.1%, 19.4%, 9.2%, 4.1% and 3.4% higher than MCL, ClusterONE, MCODE, SPICi, COACH and Core in DIP data, respectively. The Fig. 2(A) (right) demonstrates that our method achieves the best F-measure and coverage rate matching CYC2008 benchmark complexes set. In Fig. 2(B) (left), our algorithm outperform all algorithms except for MCL, we clearly see that tour method is slightly higher than MCL in coverage rate (1.9%). In F-measure, our method is higher (3.9%, 6.3%, 16.3%, 4.3%, 0.5% and 12.3%) than MCL, ClusterONE, MCODE, SPICi, COACH and Core respectively.
Similarity, in right Fig. 2(B), our algorithm also achieves the best performance on the F-measure and coverage rate. In addition all algorithms achieved better performance in Gavin data than those in the DIP, one possible reason is that the average numbers of interactions for each protein are 5.29, 10.62 in DIP and Gavin datasets [37], respectively, which means Gavin data is denser than DIP data. From above observations we can draw a conclusion that our method outperformed achieved perfect performance in all data sets in term of most considered measures.
We also evaluated the performances of all algorithms in term of matching-rate NA [38]. Tables 7 and 8 present the evaluation results in DIP and Gavin dataset using the New_MIPS and CYC2008 benchmark data set, respectively. The ”Total” denotes the number of total complexes discovered by each approach In Table 7, COACH detected most complexes matching with real complexes when ω= 0.2. When ω∈ [0.6, 0.8], our algorithm identified 82, 57 and 34 protein complexes matched with New_MIPS in DIP data, which is higher than MCL, ClusterONE, MCODE, SPICi, COACH and Core. From Table 8, we could find COACH detected comparable number complexes matching real complexes with our method when ω= 0.2, and our method predicted more matching protein complexes than other algorithms when ω∈ [0.6, 0.8]. These observations further demonstrate the validity of our method.
Figure 3 shows how the various methods detect the SAGA complex [39]. In Fig. 3, Fig. I represent the real SAGA complex in the New_MIPS benchmark, which contains 20 proteins, and Fig. II–VI exhibits the set of proteins detected by different methods from SAGA complex. Red nodes represent the proteins matched proteins in the real SAGA complex and blue ones represent false matched proteins. Our method can cover more proteins in the real SAGA complex than other methods. There are 17 proteins were predicted by our method, of which 16 proteins can match with the real ones in benchmark. COACH, ClusterONE, SPICi, MCL and Core can only cover 11,9, 8, 6 and 2 proteins of the real SAGA complex, respectively.
For Core, there are only 2 proteins matched with real complex, which is far fewer than other algorithms, so the corresponding protein set is not presented in our study.
P-value comparison
We also calculate the p-value of complexes predicted by various algorithms. The functional homogeneity of p-value represents the probability of co-occurrence of common functions proteins. The proteins in same protein complex are more like to perform common biological functions, thus they are expected to share common functions [19]. Therefore, a predicted complex with a low p-value denotes the low probability of the collective occurrence of these proteins in the complex, and thus achieves high statistical significance. In our experiments, the p-values of protein complexes are computed with the SGD’s GO::Term Finder [40]. Generally, we consider a complex that is biologically significant, when its p-value is less than 0.01 [41].
Table 9 shows the comparison results of the biological significance in DIP and Gavin datasets. In DIP dataset, 88.5% of complexes detected by our method are significant, and our method also achieved the highest proportion of significant complexes, which is 63.4%, 16.8%, 0.2%, 30.2%, 5.1% and 72.4% higher than MCL, ClusterONE, MCODE, SPICi COACH and Core, respectively. The results for Gavin dataset are similar to those in DIP dataset. From Table 9, we can analyzes our method is more efficient for identifying significant protein complexes than other algorithms. Table 10 shows some examples with lowest p-values are predicted by our method in DIP and Gavin data, respectively. The “NA” refers to the NA scores between predicted complexes and real complexes. The ”Common protein” means the number of proteins in the real complexes that are correctly covered by predicted complexes.
Biological similarity
We further analyze the biological significances of the complex cores using the GO annotation data (BP, biological process; CC, cellular component, and MF, molecular function). Pair-wise protein interactions can have similar scores based on their GO terms. In our experiments, we calculate the functional similarity between two proteins with the method in Refs. [41, 42].
Table 11 shows the average similarity of different interactions using three sub-ontologies of GO (BP,CC and MF), and the rightmost column represents the average similarity of the three sub-ontologies. In the DIP data, the average similarity of cells in BP, CC and MF is 0.285, 0.414 and 0.226, respectively. We achieved the average of three sub-ontologies of 0.308. From Table 11, we can see that the average similarity of whole network is lower than those of cells, complex cores and complexes, the similarities of interactions in cells and cores are higher than those in complexes. These findings indicate that cores have higher biological significance than the protein complex and the whole DIP network. It further demonstrates the cells are of important meaning.
Effects of the parameters
The closeness threshold of k and edge clustering coefficient of m
To investigate the effects of k on the performance of our method, we first study how the algorithm behaves in terms of F-measure and coverage rate. The detailed experimental results are presented in Fig. 4, with the gradual increase in closeness threshold of k, the F-measure shows the stable trend. Interestingly, the coverage rate shows the slight decrease trend with the addition of k.
Figure 5 represents the distribution of ECC (left) and closeness (right) in DIP dataset. We find that for ECC and closeness, a majority of value fall into the interval of [0, 0.1], which account for 60%, 57.9%, respectively. Especially, with the increasing of closeness, the proportion of which is sharply decreasing, it greatly affects the quality of predicted protein complexes. As a result, we set the closeness threshold k = 0.1 in our study.
The density threshold of t
To further analyze the F-measure and coverage rate, we calculate the distribution of the both measures under different density threshold using DIP data set, which is represented in Fig. 6. We clearly see that F-measure exhibits the stable trend when t ∈ [0.1, 0.4], with the increase of density threshold, F-measure gradual decreases when t >0.4. Another condition is that the coverage rate show the stable trend with the increase from 0.1 to 0.2 of t, when t >0.2, the coverage rate always displays the decreasing trend. Therefore, we set t = 0.2 in our experiment.
Conclusion
In this study, a novel algorithm using the core-attachment framework, is proposed to identify the protein complex in the PPI networks. This proposed algorithm largely consists of three steps: First, based on the ECC, we achieve the invariant cells as the heart of the cores. Second, we identify the complex cores from the neighborhood graphs of each cell. Finally, we generate protein complexes by adding attachments into the complex cores. In protein complexes identification process, our algorithm considers the topological information of the PPI network and the inherent organization of the protein complexes.
Comprehensive comparisons among the state-of-the-art algorithms and our method have been made in terms of the number of detected protein complexes, biological significance of protein complexes on two yeast PPI networks. Experimental results show that the proposed method achieved higher overall performance than the classic computational methods. Moreover, the cells and cores of predicted protein complexes have high functional homogeneity and biological similarity. It verifies that the cells are the core of biological cores of protein complexes.
However, only few protein complexes can be discovered in some a sparse network, which affects the whole performance of our algorithm. Therefore, our method further need to be improved due to the following reasons: i) Due to the high false-positive and false-negative rates of PPI data produced by high-through experimental techniques, which affects the accuracy of protein complexes identification, limiting the identification of real protein complexes. ii) Our approach focused on the prediction of non-overlapping protein complexes, which fail to consider the condition that a protein may have multiple functions and it may belong to more than one protein complex.
As a future study, we will investigate how to reduce noise of PPI network and improve the reliability of PPI data using some popular methods, such as gene expression, gene ontology similarity, network topology feature, and so on. Since our method mainly detected non-overlapping protein complexes, we also will further improve our method on identification of overlapping protein complexes in PPI networks. In addition, how to exploit our method to find biological meaning functional modules in human PPI network or transcriptional regulatory network is another issue in future researches.
Footnotes
Acknowledgments
This work is supported by National Natural Science Foundation of China (Grant no. 61572180) and Hunan Provincial Natural Science Foundation of China (Grant no. 13JJ2017).
