Identification of overlapping protein complexes by fuzzy K-medoids clustering algorithm in yeast protein-protein interaction networks

Abstract

The identification of overlapping protein complexes in proteinprotein interaction (PPI) networks may elucidate cellular functional organizations and their underlying cellular mechanisms. Recently, many protein complex mining algorithms have been developed for PPI networks. However, the majority of available algorithms primarily depend on mining dense subgraphs as protein complexes, thereby failing to consider the inherent biological meanings between protein pairs. Thus, methods for identifying protein complexes using the biological significance hidden in edges need to be investigated. In this paper, we propose IK-medoids, an improved method that detects overlapping protein complexes from weighted PPI networks based on the rough fuzzy relationships between protein pairs. The presented algorithm is primarily based on the fuzzy relationship that obtains the non-overlapping protein substructure, and then K-medoids is executed from the proteins in the PPI network. Next, the similarity between one protein and each candidate complex is calculated to determine whether the protein belongs to one or multiple complexes with the ration of each similarity to maximum similarity. In the end, overlapped protein complexes are merged to form the final protein complexes. We apply the method to three PPI networks and validate the results using two reference protein complexes retrieved from public databases. Experimental results show that our method outperforms classical algorithms, such as ClusterONE, CMC, MCL, OSLOM, and RFC, and achieves ideal overall performance in terms of F-measure, sensitivity, and accuracy.

Keywords

PPI protein complex overlapping K-medoids Fuzzy relation

1 Introduction

The identification of protein complexes is important in different organisms because it helps clarify biological processes and reveal inherent organizational structures within cells. With the development of computational methods, lots of protein complex detecting algorithms have been developed for protein protein interaction (PPI) networks. However, the majority of available algorithms primarily depend on mining dense subgraph as protein complexes [1, 2], thereby failing to take into account the condition that a protein may have multiple functions; therefore, the corresponding nodes may belong to more than one cluster [3]. Such conditions are challenging for classical protein complex mining algorithms that assign each node of the PPI network to just one of the protein complexes. Recently, many methods have been proposed that identify overlapping protein complexes in unweighted PPI networks [4 –7].

Recent studies have concluded that the accuracy of protein complex identification can be significantly improved by eliminating false positives with weights in PPI [7]. Several algorithms have been executed on weighted PPI networks to identify protein complexes [7 –15]. Notably, the aforementioned methods are used to detect protein complexes using network topological features; however, the major shortcoming of this is that their performance will deteriorate rapidly when applied to sparse PPI networks [16, 17]. Motivated by this issue, a number of researchers have demonstrated that the accuracy of protein complexes can be improved by exploiting various proteins data, such as gene expression information [18 –20] and multiple heterogeneous data sources [20, 21]. Despite the advantages of these approaches, the aforementioned methods for predicting protein complexes have the following common limitations:

The quality of protein complexes is limited by incomplete gene expression data.

They are unable to identify accurate overlapping protein complexes.

The identification of real protein complexes with few interactions have not been significantly improved, which limits improvements on detection accuracy of protein complexes.

To achieve a breakthrough of protein complex identification, we introduced a novel approach to detect overlapping protein complexes in PPI using rough fuzzy clustering algorithm (RFC) [12]. RFC detects overlapping protein complexes in PPI networks based on the fuzzy relation model, which first obtains equivalence classes (non-overlapping protein complexes). Subsequently, the upper and lower approximations of rough sets are used to evaluate if the proteins belong to one or multiple protein complexes based on the ratio of each similarity to the maximum similarity.

Unlike classical clustering algorithms [12, 22], an improved K-medoids, called IK-medoids, which is based on the rough fuzzy theory, can effectively detect overlapping protein complexes in PPI networks. The algorithm consists of three major steps: first, we transform the adjacent matrix of the PPI network to the fuzzy matrix according to the GO similarity value between all protein pairs in the networks. The fuzzy matrix is then transformed into the fuzzy equivalence relation by transitive closure, where the λ-cut matrix is used to achieve the equivalence classes, i.e., non-overlapping substructure, which is sorted in a descending manner according to their similarity values. Next, starting from the non-overlapping substructure with the highest similarity value, the algorithm iteratively updates the clustering center in each substructure until no change in the protein substructure is observed, which is called candidate protein complex. Thereafter, the similarity ratio of each protein and candidate complex is computed, an appropriate value is chosen, and the candidate complex is evaluated whether it is overlapping. Finally, the overlapping extent between each pair of candidate protein complexes is quantified, wherein those with overlapping scores above a specified threshold are merged.

To validate the ability of IK-medoids, we applied it to three PPI networks and compared it with five state-of-the-art algorithms by using two reference protein complex datasets retrieved from public databases. Experimental results showed that our method outperformed other algorithms in most datasets in terms of F-measure, sensitivity, and accuracy.

2 Methods

A PPI network is modeled as an undirected weight graph G = (V, E, w), where V is a set of vertex (proteins), E is a set of edges (protein pairs), and w is a set of similarity values between each protein pair.

2.1 Problem definition

Prior to presenting a detailed description of our algorithm, we first introduce the problem definition of the fuzzy rough set theory, which is widely used in the following sections.

Definition 1. (Fuzzy relation) Let $X, Y \subseteq R$ be universal sets; then R = {((x, y) , σ_R (x, y)) | (x, y) ∈ X × Y} is called a fuzzy relation on X × Y [23]. Here, σ_R (x, y) ∈ [0, 1], which is a membership function.

Definition 2. (Fuzzy matrix) If the values of all entries of a matrix are within the closed interval [0, 1], then we call that matrix a fuzzy matrix [23].

Definition 3. (Fuzzy equivalence relation) A fuzzy relation that is reflexive, symmetric, and transitive is called a fuzzy equivalence relation, named R_s, which satisfies the following conditions [23]:

Reflexive, if R_f (x, x) = 1;

Symmetric, if R_f (x, y) = R_f (y, x);

Transitive, if R_f°R_f ⊆ R_f (x, y).

Definition 4. (λ-cut matrix) Let R_f = (r_ij) _n×m, ∀λ ∈ [0, 1], then R_{f
_λ} = (r_ij (λ)) _n×m, where $r_{ij} (λ) = {\begin{matrix} 1, r_{ij} \geq λ \\ 0, r_{ij} < λ \end{matrix}$ . R_{s
_λ} is called the λ-cut matrix ofR_s [23].

Definition 5. (Equivalence class) Let R_f be a fuzzy equivalence relation, X be a nonempty. For each x_i ∈ X, the equivalence class of object x_i for R_f is defined as follows: [x_i] _R = { x_j ∈ X| (x_i, x_j) } ∈ R_f [23].

Definition 6. (Substructure) Let R_f_λ = (r_ij (λ)) _n×m, if r_ij (λ) = 1, the sub-matrice of 1 corresponds to protein sets in PPI network, then we take them as substructure.

Definition 7. (Upper and lower approximations) For set X ⊆ V, where X is a nonempty set and R_f is an equivalence relation, the upper and lower approximations of X for R_f are defined as follows [12, 24]: ${\bar{R}}_{f} (X) = {x | x \in V, {[x]}_{R_{f}} \cap X \neq \emptyset} .$ (1) ${\underline{R}}_{f} (X) = {x | x \in V, {[x]}_{R_{f}} \subseteq X} .$ (2)

Here, ${\bar{R}}_{f} (X)$ is the upper approximation of X for equivalence relation R, ${\underline{R}}_{f} (X)$ is the lower approximation of X for equivalence relation R. Obviously, ${\bar{R}}_{f} (X) - {\underline{R}}_{f} (X)$ represents the boundary region of X for equivalence relation R.

2.2 Generating substructures using fuzzy relation

After introducing the basis for fuzzy rough theory, we provide the description of substructure generation (Algorithm 1) [22].

Algorithm 1 Generating substructre by fuzzy relation
Input: PPI network G = (V, E, w);
Output: Substructure set S;
1:Transform G to R, R = (r_ij) _n×n; // R: Fuzzy matrix
2: Transform R to t (R) by transitive closure [23], $t (R) = {(r_{ij}^{*})}_{n \times n}$ ; // t (R):Fuzzy equivalence relation
3: Choose a threshold λ ∈ [0.8, 1) and transform t (R) to t (R) _λ according to Definition 4, $t {(R)}_{λ} = {(r_{ij}^{*} (λ))}_{n \times n}$ ; // t (R) _λ: Boolean equivalence relation
4: According to Definition 5, t (R) _λ is divided into several equivalence classes, [V] _R, which corresponds to different protein set in PPI network, i.e., substructure.
5: Save the substructure set and sort them in descending with GO similarity value
6: Output substrucutre set

In Algorithm 1, we transform the PPI network into the fuzzy matrix by replacing those entries of 1 with the corresponding similarity value of w, then the fuzzy matrix R = (r_ij) _n×n is transformed to the fuzzy equivalence relation $t (R) = {(r_{ij}^{*})}_{n \times n}$ using transitive closure [23] in lines 1-2. In line 3, for $t (R) = {(r_{ij}^{*})}_{n \times n}$ , according to the definition 4, we chose a threshold of λ ∈ [0.8,1), if $r_{ij}^{*} \geq λ$ , then $r_{ij}^{*} (λ) = 1$ , else $r_{ij}^{*} (λ) = 0$ . The λ is closely related to the similarity values between proteins in PPI networks. The similarity of a set S_i is $Sim (S_{i}) = \sum_{v, v^{'} \in S_{i}} Sim (p, p^{'})$ (3)

Here, Sim (p, p′) is calculated by the method proposed in 25] represent the GO similarity value between proteins. By means of λ-cut matrix, we transform the fuzzy equivalence relation t (R) into a Boolean equivalence relation t (R) _λ. Different λ correspond to different equivalence class, which is used to divide PPI networks and consequently obtain non-overlapping modules, that is, substructures (line 4). In line 5, we save the substructures and then sort them in a descending manner according to their GO similarity values. The obtained substrucutre set will be considered as candidates and expanded subsequently. Figure 1 illustrates the process of substructure generation.

Fig.1

PPI network example illustrating the process of substructure generation.

Step 1. Transform G to R using weight between each protein pair, as calculated by [26].

$\begin{matrix} G = (\begin{matrix} 0 & 1 & 1 & 1 & 0 & 0 \\ 1 & 0 & 1 & 1 & 0 & 0 \\ 1 & 1 & 0 & 1 & 0 & 0 \\ 1 & 1 & 1 & 0 & 1 & 1 \\ 0 & 0 & 0 & 1 & 0 & 1 \\ 0 & 0 & 0 & 1 & 1 & 0 \end{matrix}) \Rightarrow \\ R = (\begin{matrix} 1.00 & 1.00 & 1.00 & 0.67 & 0.00 & 0.00 \\ 1.00 & 1.00 & 1.00 & 0.67 & 0.00 & 0.00 \\ 1.00 & 1.00 & 1.00 & 0.67 & 0.00 & 0.00 \\ 0.67 & 0.67 & 0.67 & 1.00 & 0.50 & 0.50 \\ 0.00 & 0.00 & 0.00 & 0.50 & 1.00 & 0.50 \\ 0.00 & 0.00 & 0.00 & 0.50 & 0.50 & 1.00 \end{matrix}) \end{matrix}$ (4)

Step 2. Transform the fuzzy matrix R into the fuzzy equivalence relation t (R) by transitive closure [23].

$\begin{matrix} R & = & (\begin{matrix} 1.00 & 1.00 & 1.00 & 0.67 & 0.00 & 0.00 \\ 1.00 & 1.00 & 1.00 & 0.67 & 0.00 & 0.00 \\ 1.00 & 1.00 & 1.00 & 0.67 & 0.00 & 0.00 \\ 0.67 & 0.67 & 0.67 & 1.00 & 0.50 & 0.50 \\ 0.00 & 0.00 & 0.00 & 0.50 & 1.00 & 0.50 \\ 0.00 & 0.00 & 0.00 & 0.50 & 0.50 & 1.00 \end{matrix}) \Rightarrow \\ t (R) & = & (\begin{matrix} 1.00 & 1.00 & 1.00 & 0.67 & 0.50 & 0.50 \\ 1.00 & 1.00 & 1.00 & 0.67 & 0.50 & 0.50 \\ 1.00 & 1.00 & 1.00 & 0.67 & 0.50 & 0.50 \\ 0.67 & 0.67 & 0.67 & 1.00 & 0.50 & 0.50 \\ 0.50 & 0.50 & 0.50 & 0.50 & 1.00 & 0.50 \\ 0.50 & 0.50 & 0.50 & 0.50 & 0.50 & 1.00 \end{matrix}) \end{matrix}$ (5)

Step 3. Choose a threshold λ ∈ [0.5, 1], and transform t (R) to t (R) _λ, where

$\begin{matrix} t (R) = (\begin{matrix} 1.00 & 1.00 & 1.00 & 0.67 & 0.50 & 0.50 \\ 1.00 & 1.00 & 1.00 & 0.67 & 0.50 & 0.50 \\ 1.00 & 1.00 & 1.00 & 0.67 & 0.50 & 0.50 \\ 0.67 & 0.67 & 0.67 & 1.00 & 0.50 & 0.50 \\ 0.50 & 0.50 & 0.50 & 0.50 & 1.00 & 0.50 \\ 0.50 & 0.50 & 0.50 & 0.50 & 0.50 & 1.00 \end{matrix}) \Rightarrow \\ t (R)_{λ \in [0.50, 1]} = (\begin{matrix} 1 & 1 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 & 1 & 1 \end{matrix}) \end{matrix}$ (6)

Notably, t (R) _λ=0.5 represents the lower approximation of R for equivalence relation t (R), t (R) _λ=1 describes the upper approximation of R for equivalence relation t (R). Obviously, t (R) _λ=1 - t (R) _λ=0.5 represents the boundary region of R for equivalence relation t (R), which is explained in Step 4.

Step 4. When λ = 0.5, t (R) _λ is divided into one equivalence class, [A] _R ={ A, B, C, D, E, F }, which corresponds to the protein set in the network. The set of {A, B, C, D, E, F} is regarded as a substructure.

$\begin{matrix} t (R) = (\begin{matrix} 1.00 & 1.00 & 1.00 & 0.67 & 0.50 & 0.50 \\ 1.00 & 1.00 & 1.00 & 0.67 & 0.50 & 0.50 \\ 1.00 & 1.00 & 1.00 & 0.67 & 0.50 & 0.50 \\ 0.67 & 0.67 & 0.67 & 1.00 & 0.50 & 0.50 \\ 0.50 & 0.50 & 0.50 & 0.50 & 1.00 & 0.50 \\ 0.50 & 0.50 & 0.50 & 0.50 & 0.50 & 1.00 \end{matrix}) \Rightarrow \\ t (R)_{λ \in [0.67, 1]} = (\begin{matrix} 1 & 1 & 1 & 1 & 0 & 0 \\ 1 & 1 & 1 & 1 & 0 & 0 \\ 1 & 1 & 1 & 1 & 0 & 0 \\ 1 & 1 & 1 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 0 & 1 \end{matrix}) \end{matrix}$ (7)

Similarly, when t (R) _λ=0.67, t (R) _λ is divided into three equivalence class, [A] _R ={ A, B, C, D }, [E] _R ={ E }, [F] _R ={ F }, which correspond to the different protein sets in the network. The substructures are {A, B, C, D}, {E}, {F}, respectively.

From the above analysis, we can observe that different values correspond to different non-overlapping substructures.

2.3 Extending substructure by IK-medoids

Let Q = Q₁, Q₂, . . . , Q_n denote a substructure set achieved using Algorithm 1 and O = O₁, O₂, . . . , O_n denote a clustering center set. The similarity value of each edge is employed to measure the distance between proteins. The following formula is employed to select the center of substructure: $O = max {Sim (p, Q_{i}), p \in Q_{i}}$ (8) $Sim (p, Q_{i}) = \sum_{(p, p^{'} \in Q_{i}, p \neq p^{'})} Sim (p, p^{'})$ (9) where Sim (p, Q_i) denotes the total similarity value between protein p and its direct neighbor in a substructure.

After generating the clustering center of the substructure, each remaining protein out of the substructure is assigned to the clustering center with the highest similarity value, which is named q; and the initial protein clusters are achieved, which are named C₁, C₂, . . . , C_K. To improve the result of K-medoids, we introduce a new fitness function [27], which is defined as follows: $F (q) = \frac{\sqrt{E (q)}}{\sqrt{O (q)}}$ (10) where $E (q) = \sum_{i = 1}^{K} \sum_{p \in c_{i}} S^{2} (p, o_{i})$ , which indicates the sum of similarity value between a protein and the corresponding clustering center. p denotes the objects in cluster C_i and o_i represents the clustering center. $O (q) = \sum_{i, j = 1}^{K} S^{2} (o_{i}, o_{j})$ indicates the sum of similarity value between different protein clustering centers and O_j represents the clustering center.

Different from [27], the update of clustering center is located in a substructure, a nonrepresentative protein object with the highest similarity value to the clustering center, O_random, which is used to replace the representative object protein, O_i; afterward, the cluster is re-divided, marked q′, F (q′) is computed. Next, we calculate ΔF = F (q′) - F (q). If ΔF > 0, then O_i is replaced with O_random, and new clusters are formed. The procedure is iteratively executed until no change in all protein clusters is observed. Algorithm 2 shows the description of the IK-medoids algorithm. The IK-medoids algorithm mainly consists of two parts: first, IK-medoids finds non-overlapping protein complexes with high density from the generated substructure (lines 16). Second, similar to Ref. [12], IK-medoids, which has two sub-steps, determines whether proteins belong to one or multiple protein complexes. In the first step, the similarity value between each protein and candidate complexes is calculated in PPI networks. The formula is constructed as follows: $Sim (p, C_{j}) = \sum_{(p^{'} \in C_{j}, p \neq p^{'})} Sim (p, p^{'})$ (11)

Similar to Equation (5), Sim (p, C_j) denotes the total similarity value between protein p and its direct neighbor in a substructure, including those out of complex (line 7).

In the second step, the maximum similarity value of S_max (p, C_i) between each protein and candidate complexes is computed, and then we compute $ratio = \frac{S (p, C_{j})}{S (P, C_{i})}, j \neq i$ (lines 8 and 9). If ratio < β, then C_j, C_i are non-overlapping complexes; if ratio ≥ β, then C_j, C_i are overlapping complexes (lines 10-12). Here, the value of β ranges from 0.8 to 1 with a 0.02 increment.

Finally, IK-medoids merges pairs of complexes with an overlap score OS larger than or equal to a pre-defined threshold [7] and removes complexes whose density is less than 0.01. In our study, the threshold is set as to 0.64, which is the same as previous studies [12]. The merging process terminates when the overlap score between any pair of protein complex does not exceed 0.64. The post-processing is presented in detailed in Algorithm 3.

Algorithm 2 The description of IK-medoids
Input: Substructure set Q;
Output: Protein complexes list C;
// Step 1: Generate non-overlapping substructures
1: Compute the clustering center in each Substructure by Equation (4);
2: Assign each remaining protein out of Substructure to nearest representative object O_i in the Substructure, compute F (q) by Equation (6);
3: Randomly select a nonrepresnetative protein object with the highest similarity value in a substructure, O_random, replace representative objects O_i, update the representative object and re-divide the cluster, marked as q, compute F (q′) by Equation (6);
4: Compute ΔF = F (q′) - F (q), if ΔF > 0, then replace representative object O_i with O_random to form n new protein clusters;
5: Repeat step 3 and step 4 until no change in the protein clusters;
6: Save all protein clusters as candidate complexes in C;
// Step 2: Judge whether protein complexes are overlapping
7: Compute similarity value between each protein and candidate complexes S (p, C_j) by Equation (7);
8: Find the maximum value between each protein and candidate complexes S_max (p, C_i);
9: Compute $ratio = \frac{S (p, C_{j})}{S (P, C_{i})}, j \neq i$ ;
⊲ Choose a protein in the pseudo-clique cluster to remove
10: ifratio < β, then C_j, C_i are non-overlapping complexes;
11: elseC_j, C_i are overlapping complexes;
12: endif
13: return C

Algorithm 3 Merge candidate protein complexes
Input: Protein complexes list C;
The threshold for overlap δ;
Output: Protein complexes list L;
1: L ← φ;
2: for all C_x ⊂ C do;
3: for all $C_{x}^{'} \subset C$ do // $C_{x}^{'} \neq C_{x}$
4: if $OS (C_{x}, C_{x}^{'}) \geq δ$ then
5: $L \leftarrow L \cup {C_{x}} \cup {C_{x}^{'}}; C \leftarrow C ∖ {C_{x}^{'}}$
6: endif
7: endfor
8: C ← C ∖ {C_x}
9: endfor
10: L ← L ∪ C
11: return L

3 Experiments and results

3.1 Datasets

To compare the performance of our method, we apply five clustering algorithms, i.e., ClusterONE [7], CMC [9], MCL [28], OSLOM [29], and RFC [12], that can detect protein complexes in weighted PPI networks on the same test data. Parameters required in each method are set as suggested by their authors. Three weighted PPI networks are used to test the performance of our method, i.e., Krogan [30], Gavin [31], Collins [32]. The detailed information of the four datasets is shown in Table 1.

Table 1
Summary of the Three Datasets Used in Our Study

DataSet Proteins Interactions Density

Krogan 2708 7123 0.002

Gavin 1855 7669 0.004

Collins 1622 9074 0.007

DataSet	Proteins	Interactions	Density
Krogan	2708	7123	0.002
Gavin	1855	7669	0.004
Collins	1622	9074	0.007

Two gold standard datasets are used to evaluate the quality of identified protein complexes in yeast PPI network, which are MIPS [12, 33] and SGD dataset [12, 34], respectively. Summary of the two gold standard datasets are presented in Table 2. Generally, the datasets act as the role of data sources; the gold standards that usually provide the most reliable evidence for physical interactions are used to validate the performance of the proposed method, which can effectively evaluate match rate between the detected protein complexes and those in the gold standards.

Table 2

Summary of the Gold Standard Datasets in Our Study

Gold standard	Complexes	Proteins	Overlapping proteins
MIPS	203	1189	401
SGD	323	1279	332

3.2 Evaluation metrics

To assess the quality of the detected protein complexes, we match the generated protein complexes with the gold standard datasets. The definitions of these evaluation measures are introduced as follows:

Overlap score. Let p be be an identified complex and g be a complex in the gold standard database. The overlap score OS (p, g) is defined as: $OS (p, g) = \frac{| N_{p} \cap N_{g} |^{2}}{| N_{p} | • | N_{g} |}$ (12)

Here, |N_p| is the size of the predicted complex, |N_g| is the size of the gold standard complex, and |N_p ∩ N_g| is the common protein number from the predicted and gold standard complexes. If OS (p, g) ≥ θ, we consider p and g to match each other. In our experiment, we set θ = 0.2, which is reported by previous studies [4 , 36].

Precision, recall, and F-measure. Let p and g describe the detected protein complex set and the gold standard dataset, respectively, which is quantized as follows [4 , 37]: $N_{cp} = | {p | p \in P, \exists g \in G, OS (p, g) \geq θ} |$ (13) $N_{cg} = | {g | g \in G, \exists p \in P, OS (p, g) \geq θ} |$ (14) $\begin{matrix} Precision = \frac{N_{cp}}{| P |}, Recall = \frac{N_{cg}}{| G |}, \\ F - measure = \frac{2 \times Precision \times Recall}{Precision + Recall} \end{matrix}$ (15)

In Formulas (9) and (10), N_cp is the number of detected protein complexes when OS (p, g) ≥ θ, and N_cg is the number of protein complexes in the gold standard dataset when OS (p, g) ≥ θ. The F - measure is defined as the harmonic mean of Precision and Recall, which can evaluate the overall performance of the protein complex detection methods.

Sensitivity, PPV, and accuracy. In our study, we use the evaluation metric of accuracy to evaluate the performance of identified protein complexes. This parameter is defined as the geometrical mean of sensitivity (S_n) and positive predictive value (PPV): $Acc = \sqrt{S_{n} \times PPV}$ (16) where $S_{n} = \frac{\sum_{i = 1}^{n} {max}_{j = 1}^{m} {T_{ij}}}{\sum_{i = 1}^{n} nu m_{i}}$ , and $PPV = \frac{\sum_{j = 1}^{m} {max}_{i = 1}^{n} {T_{ij}}}{\sum_{j = 1}^{m} \sum_{i = 1}^{n} {T_{ij}}}$ , and num_i is the number of proteins in the i^th benchmark complex. T_ij denotes the size of the intersection between the i^th benchmark complex and the j^th detected complex. Generally, S_n indicates that the predicted protein complex covers the number of proteins in the benchmark dataset and results from the size of the identified protein complex. Therefore, the large-sized predicted protein complex has a high S_n value. PPV indicates that the detected protein complexes are likely to be true positives.

3.3 Effect of Lambda (λ)

In Algorithm 1, the parameter λ is employed to obtain the non-overlapping substrucutres. To investigate the effect of λ on the performance of IK-medoids, we first study how the algorithm behaves in terms of F - measure and let λ ∈ [0.8, 1) with a 0.02 increment. The detailed experimental results with different λ values are presented in Fig. 2. As shown in Fig. 2(a), the value of F-measure decreases gradually with the increase in λ, which is below the achieved value of λ = 0.8. From Fig. 2(a), we can clearly see that the F-measure achieves the maximum values of 0.476 and 0.475 when MIPS and SGD gold standards are used, respectively. Similarly, in the Gavin dataset, we set λ = 0.8, as shown in Fig. 2(b). In the Collin dataset, the value of F-measure increases gradually with the addition of λ, reaching up to 0.623 and 0.591 when MIPS and SGD gold standards are used when λ = 0.84, respectively, as shown in Fig. 2(c), which is set the optimal value.

Fig.2

Values of F-measure for different values of λ ∈ [0.8, 1) with a 0.02 increment in three datasets using MIPS, SGD, respectively. (a) Krogan (β = 0.94); (b) Gavin (β = 0.84); and (c) Collins (β = 0.8).

3.4 Impact of Beta (β)

In Algorithm 2, we employ the parameter β to to determine the boundaries of the upper and lower approximations, which are employed to judge whether a protein belongs to one or multiple protein complexes. In this work, β ∈ [0.8, 1). Figure 3 shows how the F-measure of our method fluctuates under various values of λ ∈ [0.8, 1) using MIPS and SGD as the gold standard datasets. Figures 3(a) to (c) present the results when different datasets are used. Notably, in Fig. 3(a), for the Krogan dataset, when λ = 0.8, the value of the F-measure increases gradually until β = 0.94 with the increase in β, such that the maximum values of 0.476 and 0.475 are reached when MIPS and SGD gold standards are used, respectively. For the Krogan dataset, we set β = 0.94. In the Gavin dataset, when λ = 0.8, as shown in Fig. 3(b), as the parameter of β radually increases, the F-measure changes slightly until it reaches the minimum values of 0.348 and 0.242when MIPS and SGD gold standards are used, respectively. For the Gavin dataset, we set β = 0.8. Meanwhile, in the Collins dataset, F-measure increases with the addition of and reaches 0.340 and 0.327 using the MIPS and SGD gold standards when λ = 0.84, as shown in Fig. 3(c), which is set the optimal value of β = 0.82. Notably, in this study, the parameter of λ, β is set to [0.8, 1). However, when the value of β is greater than 0.94, IK-medoids has no ideal performance in the Gavin dataset. To unify, the value of λ, β is qualified in [0.8, 0.94].

Fig.3

Values of F-measure for different values of β with a 0.02 increment in three datasets using MIPS, SGD, respectively. (a) Krogan (λ = 0.8); (b) Gavin (λ = 0.8); and (c) Collins (λ = 0.84).

3.5 Comparison with other methods

We compare IK-medoids with five state-of-the-art methods: ClusterONE [7], CMC [9], MCL [38], OSLOM [39], and RFC [12]. The results are listed in Tables 3 and 4, and the highest value of each dataset is presented in bold. We present the results of the three datasets by using the MIPS gold standard dataset in detail and the SGD gold standard dataset briefly.

Table 3
Performance Comparison on the Three Data Sets with Other Methods Using MIPS

Dataset Method Number Precision F-measure Sensitivity Accuracy

Krogan ClusterONE 522 0.228 0.328 0.357 0.358

CMC 142 0.549 0.452 0.234 0.285

MCL 366 0.232 0.298 0.361 0.361

OSLOM 58 0.138 0.061 0.381 0.301

RFC 122 0.361 0.271 0.483 0.295

IK-medoids 191 0.560 0.476 0.707 0.202

Gavin ClusterONE 196 0.536 0.526 0.358 0.740

CMC 341 0.416 0.522 0.254 0.311

MCL 252 0.353 0.391 0.316 0.355

OSLOM 88 0.625 0.378 0.402 0.357

RFC 153 0.575 0.494 0.409 0.375

IK-medoids 654 0.706 0.493 0.652 0.252

Collins ClusterONE 195 0.625 0.613 0.430 0.415

CMC 327 0.434 0.535 0.374 0.402

MCL 180 0.628 0.590 0.402 0.402

OSLOM 99 0.909 0.596 0.408 0.387

RFC 108 1.000 0.695 0.433 0.386

IK-medoids 357 0.297 0.340 0.610 0.217

Dataset	Method	Number	Precision	F-measure	Sensitivity	Accuracy
Krogan	ClusterONE	522	0.228	0.328	0.357	0.358
	CMC	142	0.549	0.452	0.234	0.285
	MCL	366	0.232	0.298	0.361	0.361
	OSLOM	58	0.138	0.061	0.381	0.301
	RFC	122	0.361	0.271	0.483	0.295
	IK-medoids	191	0.560	0.476	0.707	0.202
Gavin	ClusterONE	196	0.536	0.526	0.358	0.740
	CMC	341	0.416	0.522	0.254	0.311
	MCL	252	0.353	0.391	0.316	0.355
	OSLOM	88	0.625	0.378	0.402	0.357
	RFC	153	0.575	0.494	0.409	0.375
	IK-medoids	654	0.706	0.493	0.652	0.252
Collins	ClusterONE	195	0.625	0.613	0.430	0.415
	CMC	327	0.434	0.535	0.374	0.402
	MCL	180	0.628	0.590	0.402	0.402
	OSLOM	99	0.909	0.596	0.408	0.387
	RFC	108	1.000	0.695	0.433	0.386
	IK-medoids	357	0.297	0.340	0.610	0.217

Table 4

Performance Comparison on the Three Data Sets with Other Methods Using SGD

Dataset	Method	Number	Precision	F-measure	Sensitivity	Accuracy
Krogan	ClusterONE	522	0.377	0.466	0.523	0.550
	CMC	142	0.648	0.395	0.326	0.415
	MCL	366	0.338	0.360	0.523	0.523
	OSLOM	58	0.396	0.121	0.542	0.420
	RFC	122	0.546	0.297	0.638	0.420
	IK-medoids	191	0.728	0.475	0.765	0.232
Gavin	ClusterONE	196	0.642	0.485	0.412	0.513
	CMC	341	0.443	0.454	0.332	0.414
	MCL	252	0.488	0.428	0.431	0.518
	OSLOM	88	0.648	0.277	0.514	0.466
	RFC	153	0.660	0.424	0.517	0.502
	IK-medoids	654	0.926	0.488	0.630	0.310
Collins	ClusterONE	195	0.713	0.536	0.525	0.550
	CMC	327	0.507	0.510	0.470	0.512
	MCL	180	0.783	0.560	0.502	0.549
	OSLOM	99	0.969	0.455	0.527	0.510
	RFC	108	0.972	0.487	0.525	0.495
	IK-medoids	357	0.305	0.327	0.670	0.188

In Table 3, The highest value of each data set is in bold. The parameters of IK-medoids are set as follows: λ = 0.8, β = 0.94 in Krogan dataset, λ = 0.8, β = 0.8 in Gavin dataset and λ = 0.84, β = 0.82 in Collins dataset. Other methods are set the optimal parameters recommended by their authors.

Table 3 shows the comparison results of IK-medoids with other methods using the MIPS gold standard dataset. First, we compare IK-medoids with ClusterONE, CMC, MCL, OSLOM, and RFC on the Krogan dataset. As shown in Table 3, IK-medoids outperforms other methods on this dataset. In particular, IK-medoids achieves the maximum values of the metrics evaluated, except for the number and accuracy of the detected protein complex. We also observe that ClusterONE predicts the largest number of protein complexes, which are 522. The proposed method achieves the third maximum number of 191. MCL achieves the highest accuracy values of 0.361 and 0.415 in the Krogan and Collins datasets, respectively.

Second, we compare the six methods using the Gavin dataset. From Table 4, the results obtained from the Gavin dataset are similar to those from the Krogan dataset. IK-medoids detects 654 protein complexes and achieves the highest precision and sensitivity values, which are 0.706 and 0.652, respectively.ClusterONE achieves the highest F-measure of 0.526. RFC detects 153 protein complexes and achieves the highest accuracy of 0.375.

Third, we compare the six methods on Collins data set. Our method identifies 357 protein complexes and achieves the highest sensitivity of 0.610. RFC predicts 108 protein complexes and achieves the highest precision, F-measure, which are 1.000, 0.695, respectively. The precision, F-measure, and accuracy achieved by the proposed method are insignificant. Finally, the results of the six methods on the same PPI network and the SGD gold standard dataset are compared and shown in Table 4. The results using the SGD gold standard dataset are similar to those using MIP. IK-medoids achieves comparable results between the Krogan and Gavin datasets and exhibits a mild performance on the Collins PPI network.

From Tables 3 and 4, our proposed method has superior performance over RFC in most datasets. It elaborates the effectiveness of our method, and further demonstrates that it is inappropriate that those detected non-overlapping substructures are regarded as protein complexes, which discard some real proteins in complexes.

4 Conclusion

In this paper, we proposed the IK-medoids method, which improves the K-medoids clustering algorithm using rough fuzzy relation and biological meanings hidden in protein pairs. Moreover, a new fitness function was proposed to measure the distance of the two proteins. The experimental results showed that the IK-medoids algorithm produces satisfying performance compared with existing approaches. Using upper and lower approximations, we detected several overlapping protein complexes matched with gold standard datasets, which effectively improved the accuracy of the identification algorithm.

To demonstrate the utility of our method, we compared IK-medoids with ClusterONE, CMC, MCL, OSLOM, and RFC on three yeast PPI networks. The experimental results showed that our method outperforms other five state-of-the-art methods in most datasets in terms of the considered metrics. However, IK-medoids needs to analyze the whole PPI networks for re-dividing clusters, which is not suitable for more complex networks. Therefore, we further will study overlapping protein complexes in highly dense PPI networks with rough fuzzy relation.

Footnotes

Acknowledgments

The authors would like to acknowledge the assistance provided by National Natural Science Foundation of China (Grant nos. 61572180, 61472467, 61471164, 61672011 and 61602164), Hunan Provincial Natural Science Foundation of China (Grant nos. 13JJ2017 and 2016JJ2012) and the Key Project of the Education Department of Hunan Province (Grant no. 17A037).

References

Enright

A.J.

, Van Dongen

and Ouzounis

C.A.

, An efficient algorithm for largescale detection of protein families, Nucleic Acids Research 30(7) (2002). 1575–1584.

King

A.D.

, Pržulj

and Jurisica

, Protein complex prediction via cost-based clustering, Bioinformatics 20(17) (2004). 3013–3020.

Dwight

S.S.

, Harris

M.A.

, Dolinski

, Ball

C.A.

, Binkley

, Christie

K.R.

, Fisk

D.G.

, Issel-Tarver

, Schroeder

and Sherlock

, Saccharomyces Genome Database (SGD) provides secondary gene annotation using the Gene Ontology (GO), Nucleic Acids Research 30(1) (2002). 69–72.

Bader

H. BGD

, An automated method for finding mulecular complexes in large protein interaction networks, BMC Bioinformatics 4 (2003). 2.

Altaf-Ul-Amin

Md.

, Shinbo

, Mihara

, Kurokawa

and Kanaya

, Development and implementation of an algorithm for detection of protein complexes in large interaction networks, BMC Bioinformatics 7 (2006). 207.

Rhrissorrakrai

and Gunsalus

K.C.

, MINE: Module identification in networks, Bmc Bioinformatics 12 (2011).

Nepusz

, Yu

and Paccanaro

, Detecting overlapping protein complexes in protein-protein interaction networks, Nature Methods 9(5) (2012). 471–472.

Adamcsek

, Palla

, Farkas

I.J.

, Derenyi

and Vicsek

, CFinder: Locating cliques and overlapping modules in biological networks, Bioinformatics 22(8) (2006). 1021–1023.

Liu

, Wong

and Chua

H.N.

, Complex discovery from weighted PPI networks, Bioinformatics 25(15) (2009). 1891–1897.

10.

Macropol

, Can

and Singh

A.K.

, RRW: Repeated random walks on genome-scale protein networks for local cluster discovery, Bmc Bioinformatics 10 (2009).

11.

Hanna

E.M.

and Zaki

, Detecting protein complexes in protein interaction networks using a ranking algorithm with a refined merging procedure, Bmc Bioinformatics 15(2014).

12.

, Gao

, Dong

J.H.

and Yang

X.F.

, Detecting overlapping protein complexes by rough-fuzzy clustering in protein-protein interaction networks, PLoS One 9(3) (2014).

13.

Kouhsar

, Zare-Mirakabad

and Jamali

, WCOACH: Protein complex prediction in weighted PPI networks, Genes & Genetic Systems 90(5) (2015). 317–324.

14.

Wong

D.L.-K.

, Li

X.-L.

, Wu

, Zheng

and Ng

S.-K.

, PLW: Probabilistic Local Walks for detecting protein complexes from protein interaction networks, Bmc Genomics 14 (2013).

15.

Chua

H.N.

, Sung

W.K.

and Wong

, Exploiting indirect neighbours and topological weight to predict protein function from proteinprotein interactions, Bioinformatics 22(13) (2006). 1623–1630.

16.

Chua

H.N.

and Wong

, Increasing the reliability of protein interactomes, Drug Discovery Today 13(15-16) (2008). 652–658.

17.

Zhu

, Deng

S.-P.

, You

Z.-H.

and Huang

D.-S.

, Identifying Spurious Interactions in the Protein-Protein Interaction Networks Using Local Similarity Preserving Embedding, 10th International Symposium, ISBRA 2014, Proceedings 8492(0302-9743) (2014). 138–148.

18.

, Wu

, Wang

and Pan

, Towards the identification of protein complexes and functional modules by integrating PPI network and gene expression data, BMC Bioinformatics 13(1) (2012). 109.

19.

Ulitsky

and Shamir

, Ulitsky Identification of functional modules using network topology and high-throughput data, BMC Systems Biology 1(1) (2007). 8.

20.

, Xie

, Li

, Kwoh

C.K.

and Zheng

, Identifying protein complexes from heterogeneous biological data, Proteinsstructure Function & Bioinformatics 81(11) (2013). 2023–33.

21.

Yong

C.H.

, Liu

, Chua

H.N.

and Wong

, Supervised maximumlikelihood weighting of composite protein networks for complex prediction,), BMC Systems Biology 6(Suppl 2(2) (2012). 1–21.

22.

Buwen Cao

J.L.

, Liang

and Wang

, Detecting overlapping protein complexes in weighted protein-protein interaction networks using pseudo-clique extension based on fuzzy relation, 2016 International Joint Conference on Neural Networks (IJCNN), 2016. pp. 1244–1252.

23.

Shuili Chen

X.W.J.L.

, Fuzzy Set Theory and Its Applications, First Edition, BeiJing: BeiJing Press, 2005.

24.

Lingras

P.G.

, Pawan rough clustering, Wiley Interdisciplinary Reviews Data Mining & Knowledge Discovery 1(1) (2011). 64–72.

25.

Jiawei Luo

G.L.

, Song

and Liang

, Integrating functional and topological properties to identify biological network motif in protein interaction networks, Journal of Computational and Theoretical Nanoscience 11(3) (2014). 744–750.

26.

Wang

, Li

and Chen

, A fast hierarchical clustering algorithm for functional modules discovery in protein interaction networks, IEEE/ACM Transactions on Computational Biology & Bioinformatics 8(3) (2011). 607–20.

27.

Ke Luo

C.P.

, K-medoids alogrithm based on the improved granular computing, Computer Application 34(7) (2014), 1997–2000.

28.

Stichting

M.C.C.

and Dongen

S.V.

, Performance Criteria for Graph Clustering and Markov Cluster Experiments, National Research Institute for Mathematics & Computer Science, 2000, pp. 1–36.

29.

Lancichinetti

F.R.A.

, Ramasco

J.J.

and Fortunato

, Finding statistically significant communities in networks, 6(4) (2011). 336–338.

30.

Krogan

N.J.

, Cangney

, Yu

, Zhong

, Guo

, Ignatchenko

, Li

, Pu

, Datta

, Tikuisis

A.P.

, Punna

, Peregrn-Alvarez

J.M.

, Shales

, Zhang

, Davey

, Robinson

M.D.

, Paccanaro

, Bray

J.E.

, Sheung

, Beattie

, Richards

D.P.

, Canadien

, Lalev

, Mena

, Wong

, Starostine

, Canete

M.M.

, Vlasblom

, Wu

, Orsi

, Collins

S.R.

, Chandran

, Haw

, Rilstone

J.J.

, Gandi

, Thompson

N.J.

, Musso

, St Onge

, Ghanny

, Lam

M.H.

, Butland

, Altaf-Ul

A.M.

, Kanaya

, Shilatifard

, O’Shea

, Weissman

J.S.

, Ingles

C.J.

, Hughes

T.R.

, Parkinson

, Gerstein

, Wodak

S.J.

, Emili

and Greenblatt

J.F.

, Global landscape of protein complexes in the yeast Saccharomyces cerevisiae, Nature 440(7084) (2006). 637–43.

31.

Gavin

A.C.

, Aloy

, Grandi

, Krause

, Boesche

, Marzioch

, Rau

, Jensen

L.J.

, Bastuck

, Dmpelfeld

, Edelmann

, Heurtier

M.A.

, Hoffman

, Hoefert

, Klein

, Hudak

, Michon

A.M.

, Schelder

, Schirle

, Remor

, Rudi

, Hooper

, Bauer

, Bouwmeester

, Casari

, Drewes

, Neubauer

, Rick

J.M.

, Kuster

, Bork

, Russell

R.B.

and Superti-Furga

, Proteome survey reveals modularity of the yeast cell machinery, Nature 440(7084) (2006). 631–6.

32.

Sean

P.K.

, Collinsa

, Zhaog

X.-C.

, Greenblatth

J.F.

, Spencerg

, Holstegee

F.C.P.

, Weissmana

J.S.

and Krogana

N.J.

, Toward a comprehensive atlas of the physical interactome of Saccharomyces cerevisiae, Molecular & Cellular Proteomics Mc 6(3) (2007). 439–50.

33.

Mewes

H.W.

, Amid

, Arnold

, Frishman

, Güldener

, Mannhaupt

, Münsterkötter

, Pagel

, Strack

, Stümpflen

, Warfsmann

and Ruepp

, MIPS: Analysis and annotation of proteins from whole genomes, Nucleic Acids Research 34(suppl 1) (2004). 169–172.

34.

Krogan

N.J.

, Cagney

, Yu

H.Y.

, Zhong

G.Q.

, Guo

X.H.

, Ignatchenko

, Li

, Pu

S.Y.

, Datta

, Sean

P.K.

, Collinsa

, Zhao

X.-C.

, Greenblatt

J.F.

, Spencer

, Holstege

F.C.P.

, Weissman

J.S.

and Krogana

N.J.

, Toward a comprehen sive atlas of the physical interactome of Saccharomyces cerevisiae, Molecular & Cellular Proteomics Mcp 6(3) (2007). 439–450.

35.

Yijia Zhang

H.L.

and Yang

, Protein complex prediction in large ontology attributed protein-protein interaction networks, IEEE/ACM Transactions on Computational Biology & Bioinformatics 10(3) (2013). 729–741.

36.

Buwen Cao

J.L.

, Liang

and Wang

, MOEPGA: A novel method to detect protein complexes in yeast protein-protein interaction networks based on multiobjective evolutionary programming genetic algorithm, Computational Biology & Chemistry 58 (2015). 173–181.

37.

Sylvain

J.V.H.

, Brohe evaluation of clustering algorithms for protein-protein interaction networks, BMC Bioinformatics 7(1) (2006). 2791–2797.

38.

Mewes

H.W.

, Amid

, Arnold

, Frishman

, Guldener

, Mannhaupt

, Munsterkotter

, Pagel

, Strack

, Stumpflen

, Warfsmann

and Ruepp

, MIPS: Analysis and annotation of proteins from whole genomes, Nucleic Acids Res 32(Database issue) (2004). D41–D44.

39.

Lancichinetti

, Radicchi

, Ramasco

J.J.

and Fortunato

, Finding statistically significant communities in networks, PLoS One 6(4) (2011). 336–338.

Identification of overlapping protein complexes by fuzzy K-medoids clustering algorithm in yeast protein-protein interaction networks

Abstract

Keywords

1 Introduction

2 Methods

2.1 Problem definition

3.1 Datasets

Table 1 Summary of the Three Datasets Used in Our Study DataSet Proteins Interactions Density Krogan 2708 7123 0.002 Gavin 1855 7669 0.004 Collins 1622 9074 0.007

Footnotes

Acknowledgments

References

Table 1
Summary of the Three Datasets Used in Our Study

DataSet Proteins Interactions Density

Krogan 2708 7123 0.002

Gavin 1855 7669 0.004

Collins 1622 9074 0.007