Abstract
Abstract
Predicting protein complexes from protein–protein interaction (PPI) network is of great significance to recognize the structure and function of cells. A protein may interact with different proteins under different time or conditions. Existing approaches only utilize static PPI network data that may lose much temporal biological information. First, this article proposed a novel method that combines gene expression data at different time points with traditional static PPI network to construct different dynamic subnetworks. Second, to further filter out the data noise, the semantic similarity based on gene ontology is regarded as the network weight together with the principal component analysis, which is introduced to deal with the weight computing by three traditional methods. Third, after building a dynamic PPI network, a predicting protein complexes algorithm based on “core-attachment” structural feature is applied to detect complexes from each dynamic subnetworks. Finally, it is revealed from the experimental results that our method proposed in this article performs well on detecting protein complexes from dynamic weighted PPI networks.
1. Introduction
P
A PPI network can be regarded as a graph that consists of vertexes and edges, representing proteins and different interactions, respectively (Shih and Parthasarathy, 2012). In the past decade, more and more computational methods of detecting protein complexes from PPI networks have been proposed (Price et al., 2013). These clustering algorithms can be divided into three categories: graph clustering-based method, local density method, and hierarchical clustering method. King et al. (2012) proposed an algorithm—restricted neighborhood search clustering (RNSC), which is based on classical graph clustering. But the experimental results of RNSC are greatly affected by the parameters. Srihari et al. (2010) proposed another typical graph cluster method—Markov Clustering, which simulates random walking in a PPI network to find protein complexes. Spirin (2004) confirms that, by analyzing the structure of PPI network, protein complexes are related to the density and the connectivity of a local module. The better connectivity of a subgraph means that it may be a complex. From the view mentioned previously, molecular complex detection (MCODE) was presented by Bader and Hogue (2003). And to improve the accuracy (Acc), DPClus (Shigehiko et al., 2006) made a change to the stopping condition for cluster formation, which replaces vertex weight with cluster density. Palla et al. (2007) presented clique percolation method to merge all full connected graphs with k-1 common nodes. Girvan and Newman (2001) provided a method, a typical presentation for hierarchical clustering, based on removing the edges with the highest betweenness.
The false positive and false negative data provided by high-throughput techniques will play a negative role during identifying protein complexes (Kouhsar et al., 2016). So how to deal with the data is also an essential problem. In the past years, many methods are proposed to eliminate the data noise by constructing the weighted PPI networks (Liu et al., 2009; Price et al., 2013). Traditional methods of computing the weight mainly regards the topology information of PPI networks as the edge weights, such as the degree of vertex, average degree, or the density of graph. However, only considering topology information will lead to ignore and lose a lot of biological information. To take the biological information into account, domain–domain interaction information has been introduced (Hayashida et al., 2011).
In the past few years, gene ontology (GO) annotations have been introduced and accepted, because it improved the accuracy of predicting. GO is composed of three domains: BP, molecular functions (MFs), and cellular components (CCs). GO and annotations of GO are often applied to compute the semantic similarity as the weight between two different proteins (Wang et al., 2011). GO is modeled as a directed acyclic graph that represents biological knowledge and the relationship among different genes or the gene product. The higher the score of semantic similarity between two different proteins is, the more possible an interaction of these two proteins will be. The mature semantic similarity methods can be grouped into four categories (Price et al., 2013): (1) path length-based methods, they compute the semantic similarity by observing the path length or the depth to the common ancestor term of different two GO terms in the ontology structure; (2) information content-based methods, the GO annotations express a lot of information about the corresponding gene or the gene product, so the more common a couple of GO annotations are, the more similar the annotations' product is; (3) common term-based methods, they measure the repeat parts between two ancestor term sets; and (4) hybrid methods, integrating the mentioned three methods, hybrid methods incorporate two or more different categories.
The algorithms of detecting protein complexes already mentioned are focused on the static networks. In fact, the relationship of any two different proteins is not immutable, which means that a protein might interact with different proteins under different conditions. It pointed out that a protein may interact with different proteins in different stages of a cell cycle to perform a completely different protein complex (Wang et al., 2013). Because the static networks lost much temporal biological information that reduces the Acc of predicting protein complexes, many researchers realized that we should pay more attention to the dynamic networks instead of the static networks (Chen et al., 2014). Combining the PPI networks data with gene expression data has become a popular approach to construct dynamic networks. The easiest method of processing expression data just chooses the average value of all proteins' gene expression value as a benchmark; if an expression value is higher than the benchmark, the protein is activated at the time point. But this method is rarely applied, because the expression values of different proteins at different time points vary vastly, in addition to that, there is also inevitable background noise associated with the expression value, so the benchmark is hard to be calculated. Another classical algorithm is three-sigma that is proposed by Wang et al. (2011). But these proteins with low expression level will be filtered out, which is the disadvantage of three-sigma.
To reveal the dynamics of PPI network and boost the Acc rate, this article mainly focuses on the following aspects:
1. Integrating expression data and static interaction data that come from high-throughput experiments to construct the dynamic PPI networks. 2. Introducing GO semantic similarity as the weight to filter out the data noise. 3. Proposing a method based on “core-attachment” structure to predict protein complexes.
2. Methods
Figure 1 shows the expression activity values among different proteins at different time points. Based on the observation from Figure 1, we can conclude that there is a significant difference among different proteins' expression values. So the important challenge of this research will be, at a particular time point, how to judge whether a protein is active and whether an interaction from the static high-throughput data is valid.

Expression values of different proteins at different time points.
The simplest way is using a global threshold to identify the active time point. If the expression value is greater than the threshold, then the protein is active at that time point. But this method does not work well in all situations, (1) because the levels of different proteins vary from protein to protein, there is no global threshold that can suit all proteins and (2) the data are generated by high-throughput experiments that may be embedded with inevitable noise. To construct dynamic networks, Wang et al. (2013) proposed a method called three-sigma to identify active time points of each point. But, the three-sigma method did not work well on the data set with low expression values, such as the YLR332W. This article improves the three-sigma method based on parameter estimation to optimize the variance and mean for each protein.
The number of expression data provided by the experiment method is much smaller than that in a real cellular cycle. So the expression data that we got can be considered as a sample of the population. What we need to do is to estimate the population's variance and mean by those samples. In statistics, we know the interval estimation is often used to calculate the population's variance or mean when the mean or the variance of sample is given. So the interval estimation is introduced to improve the three-sigma algorithm.
2.1. Determine the interval of population's mean and variance
As the expression values of proteins in the static PPI network vary with the time or environment, thus, a static PPI network PN can be broken down into several dynamic networks with smaller size expressed as

Convert static network into dynamic network at different time points.
Figure 3 illustrates that the expression values vary cyclically every 12 time points. To reduce the time complexity, the expression value of each moment spanning the 12 time points is calculated by the average of three cycles as given in Equation (1).

Periodic expression among different proteins.
where
Suppose that samples
where
Under the condition of confidence level

T-distribution function.
Based on Equations (2) and (4), Equation (5) about population's mean can be inferred as follows:
So the confidence interval of the population's mean
In a similar manner, the maximum possible interval of variance shown in Figure 5 can be written as the mathematical expression as Equation (6).

Chi-square distribution function.
Thus, combining Equations (3) and (6), we can get Equation (7).
In summary, the confidence interval of the population's variance
2.2. Improved three-sigma method
The traditional three-sigma method described as Equation (8) is likely to filter out active proteins with low expression.
in which
From Equation (8), it is observed that the active threshold depends on the mean and variance. A protein with low expression value at most time points, such as YLR331C, will be filtered out incorrectly according to this formula. To solve this problem, this article makes some improvements on
where
where Pi is a column vector about the active probability of all proteins at time point i, and
2.3. GO semantic similarity
As a measure of filtering out data noise, weighted network has become more essential. Among these proposed approaches, topological information (Li et al., 2012) is the first applied method. However, the weight of a protein interaction does not only depend on the topological structure but also relies on the biological information. Ozawa et al. (2010) and Ma et al. (2012) applied the domain interaction information into constructing a weighted PPI network and confirmed that the bioinformatics plays a pivotal role in predicting protein complexes. At the meantime, more researchers tend to look favorably on GO semantic similarity, which is a function computing closeness in meaning between terms with an ontology in the past few years. Three approaches that are Jiang and Conrath (1997), Lin (1998), and SimGIC (Lu et al., 2012) are adopted in this article to calculate the semantic similarity.
These three semantic similarity measures contribute from different perspectives: to integrate the different information and remove redundant information, principal component analysis (PCA) is introduced, which is a statistical procedure and its function is to reduce the dimension of the sample feature. The weight of an interaction i in the network computed by different methods can be represented as a vector Vi:
Some GO without annotations will influence the construction of weighted PPI network. In this article, the mean of semantic similarity values with annotation will be considered as the weight of interaction without annotation.
So far, we can build a complete network at any time point i.
where wi is the weight matrix and
2.4. Algorithm of predicting protein complexes
Dezso et al. (2003) and Gavin et al. (2006) pointed out that the structure of a protein complex can be described as a core-attachment as shown in Figure 6, where proteins A, B, C, and D form a group called core, all of the proteins E, F, G, and H are the attachments around the core. By the inspiration of this structure, a novel algorithm called DWCOACH is proposed in this article. The new method mainly contains several subfunctions, including filtering seed proteins, construction of cores, adjunction of attachment proteins, and elimination of redundant protein complexes. Algorithm 1 describes the main process of predicting protein complexes. The static PPI network is regarded as the input, and the protein complexes are the output. Line 3 is to deal with the expression value by using the three-sigma method to shift static PPI network to dynamic PPI network. The main task of lines 4–6 is to compute all the weighted local clustering coefficient, and add the proteins with its clustering coefficient into a set Initial-protein, which will be used in the seed-Generation subfunction to find the seed for the construction of core in line 7. In line 8, the selected seeds and the dynamic subnetworks are used in the other subfunction—core-Construction, to construct a protein group made up of seed proteins as a core of a protein complex. To expand a core, the attachment proteins that satisfy the judgment condition will be added into a core as shown in lines 9–10. Many redundancies are generated from the first 10 lines, so the last thing we need to do is refine the complexes as shown in line 11.

Core-attachment structure.
Algorithm 2 is to generate some seeds to prepare for constructing the core. At the beginning, the seeds from subnetwork
where
Algorithm 3 is designed to construct a core—a group of proteins selected from the seed set. At the beginning, each protein in the seed set will be regarded as a subgraph as shown in lines 2–4. To expand the core, in line 5, the protein, which is generated through the subfunction core-Expand, will be regarded as an alternative protein to expand the core. Whether the alternative protein can be added into the subgraph depends on the jugging condition in lines 6–8, which describes that the density of subgraph must increase after appending a protein. And the density of a graph can be computed by Equation (11) as follows:
where
Figure 7 shows the process of a core's expansion. At first, only protein A is in the subgraph, then the adjacent points of protein A are connected with A, a protein is added into the core whose density is 0.86. In the second step, B, C, and D are chosen as the adjacent proteins of cores A and E, the sum of D's weight is the most maximum, so D is selected to expand the core, which leads the density of the new core to grow up to 0.8733. In the third step, it adopts a similar operation in the first two steps, but the core does not update because the density reduces from 0.8733 to 0.64, which does not satisfy the jugging condition. So the core is composed of A, D, and E.

An example of the core's expansion process.
When the construction of core is complete, the next problem is how to determine the attachment. Equation (12) is considered as an evaluation function to measure the closeness between the attachment and the core, if the value of evaluation function is satisfactory enough, the protein represented by v in the function will be integrated into the core as an attachment. The complete process is described in Algorithm 5.
As shown in line 2, all of the cores are considered as a complex firstly, then line 4 considers the proteins in the PPI network who is the adjacent points of the proteins in the core but not in the core, will be the alternative attachments; if its attach-score is more than the threshold which is value of 0.3 in this article, the alternative attachment will be added into the complex, as performed in lines 4–13. At last, a full complexes set will be gotten. However, there is a lot of redundancy among the complexes, so we proposed another algorithm to eliminate the redundancy. Each protein complex can be regarded as a set, the method of handling sets is applied to compute the similarity between two different complexes as expressed in the Equation (13) (Lakizadeh et al., 2015):
In Algorithm 6, there are three situations that need to be discussed. The first one is if there are two different size complexes with high similarity, the complex with smaller size will be regarded as a benchmark, and the protein in the set difference of the two complexes will be considered only if it can be added into the benchmark by the attachment score function. The second situation is if there are two same size complexes with high similarity in line 15, under this condition, the protein complex with high graph density will be selected, and the complex with lower graph density will be abandoned. The third situation is that if two complexes are not similar to each other, then both of the protein complexes will be chosen to expand the set of complex. The threshold in Algorithm 6 is equal to 0.7.
Some highly active proteins express in most of time points, which leads to the fact that there will be a large part of repetition among the complexes predicted at different time points. So removing the redundant parts from the protein complexes composed of the smaller complexes is necessary. In Algorithm 6, eliminating redundancy algorithm will be applied again with all of the complexes predicted at each time point as the input.
To sum up, our method employs several extraordinary ideas which have not seen in other publications:
1. The PPI networks are shifted from static to dynamic; as the expression value of different proteins varies in different time points and various situations, not all of the protein interactions are active at all times, an improved three-sigma method is introduced to estimate the time when the interaction will be active. The improved three-sigma method with interval estimation shows better results in the sample's mean and variance. 2. Existing algorithms for predicting the protein complexes are inadequate for the PPI networks with weight. A new method that derives from the “core-attachment” structure will decrease the possibility of high overlap among the different protein complexes and improve the Acc of predicting results. 3. In the traditional approach, while calculating the weight of PPI network, only the topological structure information is considered. The biological information is not taken into computation of network's weight. Our approach first applied the GO annotation, implicit biological information, into computing the semantic similarity as the weight between two proteins. Besides, multiple methods integration has been used to calculate semantic similarity from different aspects.
3. Result and Evaluation Metrics
In this section, we first introduced several evaluation metrics. Then, we described the database including the protein interaction database, gene expression data, and the benchmark protein complex data used in our experiments. Last, we discussed the detail of experimental result, presented the effect of the parameters on the experiment. The result of our method proved that integrating expression data and GO semantic similarity with protein interaction data is an effective approach for predicting protein complexes.
3.1. Evaluation metrics
To evaluate our method and compare it with the benchmark, several evaluation metrics such as precision, recall, F-measure, and other respects are given in this section.
First, the neighborhood affinity score that is used to measure the match degree among the predicted complexes and the actual complexes is defined as follows:
where P is a predicted complex, B is the benchmark complex, and Vi is the vertex set of i. Based on the experience from the previous article (Kouhsar et al., 2016), when
The same argument applies to the set Ncb:
in which the benchmark complex must match at least one complex predicted by out method. Based on the definition of neighborhood affinity score, the metrics including precision, recall, and F-measure can be expressed as follows:
The precision value describes how many protein complexes predicted by a method are correct. In contrast, what the recall expresses is that how many benchmark complexes or known complexes have been indexed by the method. Sometimes the precision and the recall have an impact on each other, so F-measure is introduced to balance precision and recall. In addition to that, sensitivity (Sn), positive predictive value (PPV), and Acc are another group of evaluation metrics introduced in this article. If we denote
where Ni is the number of proteins in the benchmark complex. There is another metric called Acc that is a combination of Sn and PPV.
Second, as another evaluation metric, Jaccard score is also introduced in this article to describe the matching degree, the range of Jaccard score is from 0 to 1. When the score is equal to 1, it means the predicted complexes match with the benchmark complexes exactly. Otherwise, if the score is equal to 0, it indicates that there is no intersection between the predicted complex and the benchmark complex.
3.2. Experiment database
In this article, the STRING data (Franceschini et al., 2013) are chosen as the PPI network data. In addition to the biological experimental data, data mining from the context abstract and other databases, there is also some PPI data predicted by bioinformatics method. GSE3431 (Tu and Mcknight, 2005) is introduced as the gene expression data. GO annotation file is downloaded from Saccharomyces Genome Database (Issel-Tarver et al., 2002). For the benchmark, we choose CYC2008 (Pu et al., 2009), which is composed of 408 protein complexes.
3.3. Experiment result analysis
In this part, we try to illustrate how our method is compared with other predictors through several comparative experiments.
3.3.1. The discussion of parameters in our method
From Table 1, first, we can seen that as the step increases, the number of active proteins drops slowly; second, although under the same step, there is also a large difference among the number of proteins at different time points, and the expression ability of a protein is the weakest at time point 12. Third, through the analysis of the table, it can be concluded that not all of the proteins are active all the time, their expression values change with time.
Bold/italic values are the maximum or minimum in each row.
Table 2 is to find the most appropriate parameter d in the improved three-sigma method; it illustrates the impact of different steps on the experimental results under the same threshold conditions. It is evident that the highest value of precision and PPV are 0.536 and 0.347, respectively, when the step parameter is 9; in addition, the comprehensive indicators F-measure and Acc that arrive at the peak point are 0.438 and 0.374, respectively, when the step parameter is 10, which means that both of the mean and variance adopted in the three-sigma method are better than sample mean and variance. From Table 2, it can be concluded that our method can perform better at the condition that the step parameter is equal to 10.
Acc, accuracy; PPV, positive predictive value; Sn, sensitivity.
Bold/italic values are the maximum in each column.
In Algorithm 2—seed-Generate, the cluster-threshold representing the density of a seed set is set to control the seed size, if the lower the cluster-threshold is, the larger the size of seed will be. So the result of predicted protein complexes is closely related to the cluster-threshold value. To determine the optimal cluster-threshold parameter, as shown in Table 3, we compare the experiment results produced under different parameters. As illustrated in Table 3, when cluster-threshold is 0.2, our method performs best on predicting protein complexes. In contrast, it also reflects another drawback of our method that the experimental results depend on the number and value of parameters.
Bold/italic values are the maximum in each column.
If we check the time point wherein all of the three evaluation metrics, precision, recall, and F-measure, reach their maximum point, we have found that they achieve the highest value of 0.382, 0.665, and 0.486, respectively. However, Sn arrives at the highest value with 0.795 when cluster-threshold is 0.8, and PPV and Acc come to the best value with 0.197 and 0.374 when cluster-threshold is 0.7. From Table 3, it can be found that when the cluster-threshold is >0.7, the values of precision and recall decrease. It is related to the PPI data used in the experiment, because the weighted density of most subnetworks constructed in this article is <0.8, resulting in few seeds chosen.
The attachment score threshold is introduced in Algorithm 5 to detect protein complex. To find the appropriate threshold, our method is run on the dynamic PPI network. As shown in Figure 8, the values of Acc remain stable under various thresholds, but the values of F-measure fluctuate slightly. Among the range from 0.15 to 0.7, it keeps stable and averages at 0.43; however, when the threshold is >0.75, the values of F-measure decrease to around 0.4. According to this experiment, attachment score threshold is set to 0.6 as default.

The F-measure and Acc values of our method for various values of attachment score threshold. Acc, accuracy.
3.3.2. Semantic similarity score analysis
To analyze the semantic similarity computed by different methods, frequency statistical graphs in these three aspects including BP, CC, and MF are introduced. As shown in Figures 9–11, even though in the same aspects, the semantic similarity scores vary greatly according to the computational methods. For example, Figure 9 describes that when computing the semantic similarity in BP, the scores of Jiang's method are mainly distributed from 0.1 to 0.2; the range of Lin's method covers from 0 to 0.8, which is much wider than Jiang's and SimIC's; also, the score frequency of SimIC's method decreases gradually, most of the scores distribute between 0 and 0.3. However, most of scores measured by PCA are still between 0 and 0.2, and the range of similarity is also as wide as the Lin's method, so PCA integrates the advantages of the other three measures.

Semantic similarity frequency statistical graph of different methods on biological process. PCA, principal component analysis.

Semantic similarity frequency statistical graph of different methods on cellular component.

Semantic similarity frequency statistical graph of different methods on molecular function.
In addition, Figure 12 shows the effect of different semantic similarity methods on the experimental results. It is clear that the precision, recall, and F-measure of the weighted networks that have been processed by PCA perform better than the other three methods. Furthermore, the icon also confirms that the PCA plays a positive role in removing data noise and improving the Acc of prediction results.

Results compared with different semantic similarity.
3.3.3. Comparison with the known complex
Figure 13 shows the change of Jaccard score from our method at each time point, it reaches a peak of 0.38 with the bottom around 0.27, and from the figure it can also be concluded that the Jaccard scores fluctuate significantly, so the average value is chosen as the evaluating indicator finally to compare with other methods. Besides, the unstable Jaccard values also reveal the dynamics of the PPI network built in this article.

Jaccard scores of our method at different time points.
We compared our method with other classical algorithms, such as the coach (Lakizadeh et al., 2015), WCOACH (Price et al., 2013), MCODE (Bader and Hogue, 2003), MCODE-weight (Price et al., 2013), and Ipca (Min et al., 2008) to assess the predictor carefully. And as shown in Table 4, all of the Jaccard scores are around 0.3, the score predicted by the MCODE method is the lowest with 0.2913. In contrast, our method performs slightly better than WCOACH and MCODE-weight method, 0.31975, 0.30572, and 0.30285, respectively. So the table confirms that our method may perform well in predicting protein complexes.
To be objective, all algorithms are run in the same PPI data and the same benchmark is selected. Figure 14 shows the corresponding precision–recall graphs comparing different traditional cluster algorithms with our method proposed in this article. In most of the recall range, our method still maintains greater precision, in addition, our method outperforms with significantly higher recall, as well as greater precision among the initial top predictions (at recall <0.4).

Precision–recall graph of four cluster algorithms.
It is clearly observed from Figure 15 that our method proposed in this article reaches the highest precision, recall, and F-measure. The Sn value of coach algorithm is higher, but its PPV reduces to 0.14, which leads to a worse Acc value. Both our method and coach algorithm are in light of “core-attachment,” but some details such as density formula, eliminating redundancy, are different. Nevertheless, precision, recall, and F-measure of our method perform better, which illustrates that our method improves the Acc of predicting protein complexes. Moreover, our method introduced semantic similarity dealt with PCA as the weight between PPIs. Although the other two classical algorithms, WCOACH and MCODE-weight, are introduced to predict protein complexes from weighted PPI networks, unfortunately, these two algorithms' effect is not as good as our method, their precision, recall, and F-measure are less than our method. It indicates that semantic similarity dealt with PCA works better than other approaches. In summary, it is assumed that our method effectively improves the prediction Acc of protein complexes.

Results compared with different algorithm.
4. Conclusions
This article proposed an approach to integrate multiple techniques to improve the predicting Acc of the protein complexes. First, we integrate static PPI networks with gene expression data to construct a dynamic network through combining interval estimation with the three-sigma method. The improved method performs better on the data set with low expression values than the original method. Secondly, GO semantic similarity that contains biological information is applied to filter redundant information and process the weight of network. It proves that GO semantic similarity reflects the degree of intimacy between the two proteins better than the topological information. Third, after building a weighted PPI dynamic network, DWCOACH is introduced to predict protein complexes. And, it decreases the overlap among different protein complexes and improves the prediction Acc. Several experiments are conducted and the results prove that our algorithm performs better than other algorithms on selected evaluation metrics. We also observed that there are some drawbacks in our method, for example, the test result is sensitive to the number and value of parameters, which indicates that more experiments should be carried out to find out the best parameter values.
Footnotes
Acknowledgments
This work was supported, in part, by National Science Foundation of China under Grant No. 61402304, the Humanity and Social Science general project of Ministry of Education under Grant No.14YJAZH046, the Beijing Natural Science Foundation under Grant No. 4154065, the Beijing Educational Committee Science and Technology Development Planned under Grant No. KM201610028015, Science and Technology Innovation Platform, Teaching Teacher, and Connotation Development of Colleges and Universities.
Author Disclosure Statement
No competing financial interests exist.
