Abstract
Abstract
Biclustering is a process of finding groups of genes that behave similarly under a subset of conditions. In this article, we propose an efficient biclustering algorithm, namely RN+, to identify biologically meaningful biclusters in gene expression data. The RN+ algorithm finds biologically meaningful biclusters through a novel gene filtering using protein–protein interaction network, gene searching, gene grouping, and queuing process. It also efficiently removes duplicate biclusters. We tested the proposed RN+ on five real microarray datasets, and compared its performance with seven competitive biclustering algorithms. The experimental results show that RN+ efficiently finds functionally enriched and biologically meaningful biclusters for large gene expression datasets, and outperforms the other tested biclustering algorithms on real datasets.
1. Introduction
Microarray data are a two-dimensional matrix of gene expression levels measured under different conditions (Saullo et al., 2015). The data can be viewed as an m × n matrix with m genes (rows) and n conditions (columns), in which each entry represents the expression level of a gene under each condition (Wang et al., 2016). For many analyses, it is important to find clusters of genes that have the same function. Clustering has been an important tool for analyzing microarray data. Conventional clustering methods such as K-means have found identified genes or samples that are functionally related under specific conditions (Li et al., 2012). Some genes are not coexpressed under all conditions and may be similar only in a subset of experimental conditions (Li et al., 2012).
Biclustering is a process of identifying groups of genes that behave similarly under a specific subset of conditions. However, finding biclusters in microarray data is difficult because the number of potential biclusters increases as the number of genes and the number of samples increase (Bhattacharya and Cui, 2017). Biclustering has been proven to be a non-deterministic polynomial time (NP)-hard problem (Ahn et al., 2011). Many biclustering algorithms therefore use heuristic methods, and their solutions may be suboptimal (Ahn et al., 2011). There are advantages and disadvantages to each algorithm, depending on the pattern of biclusters. This article proposes an algorithm, RN+, which is a variant of the RN clustering algorithm described in Ahn et al. (2011), and which tries to find as many functionally associated biclusters as possible. We tested and compared RN+ with seven competitive biclustering algorithms on real datasets. The results show that RN+ outperformed other biclustering algorithms in terms of both the number of enriched biclusters and the percentage of biologically meaningful biclusters.
2. Related Work
Existing biclustering algorithms can be divided into several categories. The first category is pattern-based biclustering. This approach tries to find strictly additive or multiplicative patterns. However, many of pattern-based biclustering algorithms are exponential with respect to the size of the microarray data or the noise level of the microarray data (Xu et al., 2006). Therefore, this approach is not practical for large datasets. Moreover, these pattern-based algorithms tend to find too many biclusters that are only slightly different. The second category is tendency-based biclustering, algorithms that identify Order-Preserving Submatrices (OPSMs) and use them to find biclusters (Ahn et al., 2011). The OPSM belongs to this category (Ben-Dor et al., 2003). It was reported that the OPSM has shown good gene ontology (GO) validation results (Ahn et al., 2011). However, the OPSM finds only one bicluster with the best results every time it is executed, so multiple biclusters may not be identified. This algorithm also risks missing hidden patterns in microarray data (Ahn et al., 2011).
Another category is a divide-and-conquer algorithm. These algorithms divide a problem into small subproblems, solve each subproblem recursively, and then combine subsolutions into one solution. The Binary Inclusion-Maximal Biclustering Algorithm (Bimax) is an example of this category. It finds all submatrices whose entries are equal to 1 in a binary matrix. It has been reported that Bimax performs well on certain patterns, such as upregulated biclusters (Eren et al., 2012). Another algorithm, Statistical-Algorithmic Method for Bicluster Analysis (SAMBA), employs a statistical model of biclusters and develops combinatorial methods for biclustering large datasets. The SAMBA enumerates all the possible biclusters and showed robustness with respect to noise. Another algorithm, Qualitative Biclustering (QUBIC), employs statistical models to find all statistically significant biclusters. In addition, the iterative signature algorithm (ISA) (Bergmann et al., 2003) and Sequential row-based biclustering algorithm for analysis of gene expression data (UniBIC) (Yun and Yi, 2013) biclustering algorithms have been described for finding biclusters in the microarray data.
3. Method
3.1. Notations
In this article, we use the following terminologies:
g0, g1,… gm-1: genes
s0, s1, …. sn-1: samples
p-BCS: bicluster candidate set (BCS) consisting of p samples
deg_th: user-specified degree threshold
dist_th: user-specified distance threshold
edge_th: user-specified edge threshold
mns: user-specified minimum number of samples in BCS
mng: user-specified minimum number of genes in BCS
st: user-specified similarity threshold between two BCSs
3.2. RN+ algorithm
The RN+ algorithm is composed of five main steps:
Gene filtering using protein–protein interaction (PPI) network data Find 2-BCS (bicluster candidate set); Obtain (p + 1)-BCSs from p-BCSs; Single priority queuing; Remove duplicate BCSs
3.2.1. Step 1: Gene filtering using gene network data
The goal of our first step is to find genes that are closely related to each other using the PPI network data. PPI network data can be represented as a graph. Each node represents a gene, and nodes are connected by undirected edges. We set the weight of the edges to 1. Step 1 is composed of three substeps: (1) remove genes with a high degree, (2) create subgraphs (gene sets) for each gene and remove duplicate gene sets, and (3) remove relatively sparse subgraphs, which contain few edges. First, we remove high-degree genes from the graph, since genes that are not highly related may have a short distance from other genes due to the presence of high-degree genes. The distance between two genes is the number of edges in the shortest path between them. Let the degree of a gene v and the total number of genes (vertices) in the graph be deg(v) and M, respectively. We remove edges of high-degree genes, which do not satisfy the following equation:
Second, to find a dense subgraph for each gene, we set a parameter called dist_th. We visit all genes and construct subgraphs for each gene, whose distance from the current gene is less than dist_th. When we create subgraphs by going through all the genes in this way, the number of subgraphs becomes equal to the number of genes, and there are many subgraphs that are similar to each other. We then remove duplicate subgraphs. If the overlapping ratio of the two subgraphs intersecting each other exceeds st, then the subgraph with the smaller size is removed to eliminate duplicate gene sets. For example, if st is 0.7 and two subgraphs are {g3, g4, g9, g13} and {g3, g4, g9}, the overlapping ratio is
The presence of many edges in any of the remaining subgraphs indicates that the genes belonging to the subgraph are highly related. The maximum number of edges in the ith subgraph is
Figure 1 shows a toy example. An initial graph with 13 genes is shown in Figure 1a. For this example, we use the following parameters: deg_th = 0.5, dist_th = 2, and edge_th = 0.4, st = 0.7, and mng = 3. First, the edges of g1 are removed since it does not satisfy (1). The output graph after removing a large-degree gene, in this case g1, is shown in Figure 1b. Then, a total of 13 subgraphs for each gene are generated. Three subgraphs, {g8, g12}, {g3, g4, g9, g13}, and {g5, g6, g7, g10, g11}, remain after removing duplicate gene sets and, additionally, {g8, g12} is removed, since the number of genes comparing it is less than mng (Fig. 1c). Finally, only {g5, g6, g7, g10, g11} remains, and {g3, g4, g9, g13} is removed since it does not satisfy (2).

Toy example of gene filtering using protein–protein interaction network data when deg_th = 0.5, dist_th = 2, edge_th = 0.5, st = 0.7, and mng = 3.
3.2.2. Step 2: Find 2-BCS
If the number of samples in the microarray data is n, extract all possible two-sample pairs from these samples to generate a 2-BCS for all genes, which passed step 1. So, a p-BCS is a small bicluster candidate with p-samples for being expanded to a larger bicluster. If the expression values in other samples are the same as those of a specific gene, these sample pairs are excluded when making a 2-BCS. This approach creates a maximum number of possible cases of C (n, 2) when generating 2-BCSs. However, any 2-BCS that cannot meet the mns condition, which is specified by the user, is not created. For example, if there are four samples (n = 4) in a microarray dataset, s0, s1, s2, and s3, then a total of six 2-BCSs are possible, {s0, s1}, {s0, s2}, {s0, s3}, {s1, s2}, {s1, s3}, and {s2, s3}. However, if mns = 3, {s1, s3}, {s2, s3} that cannot be further expanded are not generated.
3.2.3. Step 3: Obtain (p + 1)-BCSs from p-BCSs
When expanding from p-BCSs to (p + 1)-BCSs, we select the last sample of each p-BCS sample set as the sample to be expanded. For example, when making 3-BCS from 2-BCS, if the samples in 2-BCS are {s1, s2}, then the last sample is s2, so we select s3 as the next sample to expand. That is, (p + 1)-BCS is obtained from p-BCS using breadth-first search (BFS) as shown in Figure 2.

Obtaining (p + 1)-BCSs from p-BCSs using breadth-first search. p-BCS, bicluster candidate set consisting of p samples.
When a (p + 1)-BCS is obtained from a p-BCS, genes having similar patterns of expression values are grouped together to form the (p + 1)-BCS. The similarity between them is measured by the following
Here,
Specifically, we divide the
Figure 3 shows an example of this gene grouping process when δ is 2. First, we multiply the lowest

Example of gene grouping process when δ = 2.
3.2.4. Step 4: Queuing
When we expand (p + 1)-BCSs from p-BCSs, many duplicate gene sets may exist. Due to memory issues, we cannot keep all BCSs in memory. Priority queuing is used to solve this problem. We prioritize genes with large gene sets, since our goal is to find large biclusters with diverse memberships. Unlike Ahn et al. (2011), we used a single, rather than a multiple, priority queue, since preliminary experiments showed that a large number of gene sets in BCSs overlap in different queues. When multiple priority queuing is used as in Ahn et al. (2011), both small and large size gene sets may coexist in each queue. But due to these large gene sets, when duplicate BCSs are removed from each queue (step 4), the smaller size gene sets are removed first from the queue and as a result, only the duplicate gene sets remain in each queue. This approach eliminates the chance that small gene sets, which are removed from this queue, could expand to a more diverse set of genes. Therefore, we use a single priority queue and remove redundant large size gene sets through duplicate BCS removal process (step 4), with the result that diverse gene sets remain in the queue.
Figure 4 shows an example of why single priority queuing is preferable to multiple priority queuing. In this example, a similarity threshold between two BCSs specified by the user, (st), is set to 0.6. When multiple priority queues are used as shown in Figure 4a, the gene set {g5, g6, g7} is removed from the first queue, because the overlapping ratio of the gene set, {g5, g6, g7}, is 1, which is larger than the st value. But the gene set {g1, g2, g3, g4, g5, g6, g7} will remain in priority queue 1. Similarly, in priority queue 2, the gene set {g6, g7, g8, g9, g10} is removed because the overlapping ratio (i.e., 0.6) is equal to the st value. Therefore, the gene set {g2, g3, g4, g5, g6, g7, g8} remains in priority queue 2. However, the two gene sets {g1, g2, g3, g4, g5, g6, g7} and {g2, g3, g4, g5, g6, g7, g8} remaining in each queue are very similar. If we use a single priority queue as shown in Figure 4b, one of the two gene sets, {g1, g2, g3, g4, g5, g6, g7} and {g2, g3, g4, g5, g6, g7, g8}, will be removed. Also, gene set {g6, g7, g8, g9, g10} is not removed, since its overlap ratio, 0.4, is smaller than the st value (0.6).

Comparison of two priority queuing strategies when st = 0.6.
3.2.5. Step 5: Remove duplicate BCSs
In the priority queue, there exist many BCSs with duplicate gene sets. The process of removing the duplicate BCSs is as follows: first, the BCS with the highest priority (the BCS with the largest gene set) is compared with the rest of the BCSs in the priority queue. If the overlapping ratio of the two BCSs intersecting each other exceeds st, then the BCS with the smaller size is removed. Second, the BCS with the second highest priority among the remaining BCSs is compared with the rest of the BCSs, and if the overlapping ratio of the two BCSs intersecting each other exceeds st, then the BCS with the smaller size is removed. This process is repeated until the BCS with the lowest priority has been handled.
In Ahn et al. (2011), there is an additional step of removing duplicate BCSs using a bit string after performing the above duplicate BCS removal process. First, the algorithm finds the gene with the highest index among the remaining genes and creates a bit string of that size whose bits are all false (0). Then, from the highest priority gene set in the queue, we set the ith bit of the bit string to true (t) if the ith bit is false, and the gene set includes ith gene. Let the ratio be
In this work, we do not use this duplicate BCS removal process using a bit string, since it could remove a gene set that will expand to a unique bicluster. Figure 5 shows one example when st is 0.6. Suppose there are three gene sets, {g1, g2, g3, g4, g5}, {g8, g9, g10, g11, g12}, and {g5, g6, g7, g8}, in the priority queue. Then, when the duplicate BCS removal process using the bit string is used, gene set {g1, g2, g3, g4, g5} and gene set {g8, g9, g10, g11, g12} corresponding bits are changed to 1 in the bit string and retained (Fig. 5). And the gene set, {g5, g6, g7, g8}, is removed, because its ratio is 2/4, which is smaller than the st value (0.6). However, the removed gene set, {g5, g6, g7, g8}, overlaps only a small portion of other two gene sets.

Elimination of duplicate BCSs when st = 0.6.
4. Results
4.1. Datasets
To evaluate the effectiveness of the RN+ algorithm, a total of five real microarray datasets, GDS181, GDS589, GDS1027, GDS1406, and GDS1490, were used (Padilha et al., 2017). The datasets include Homo sapiens, Rattus norvegicus, and Mus musculus (Table 1). We also downloaded PPI network data from BioGRID (Oughtred et al., 2019) for H. sapiens, R. norvegicus, and M. musculus. The detailed information is shown in Table 2.
Description of Gene Expression Datasets
Description of Protein–Protein Interaction Datasets
4.2. Parameters
The following parameters were used: the deg_th is used to identify high-degree genes and is set to 0.01. The dist_th is set to 2. The edge_th is set to 0.001. The minimum number of samples in the BCS specified by the user, mns, is set to 4. The mng is the minimum number of genes and is set to 30. The st value represents degree of overlap between biclusters. We set st = 0.7. When the st value increases, more biclusters can be found since the parameter allows bicluster overlaps with other biclusters. Queue size was set to 1000. The δ value is a user-specified parameter that determines how much of the t-value difference in the same BCS should be tolerated. The δ value is set to 1.8 to find as many diverse biclusters as possible.
4.3. Computational cost of RN+
Let the number of genes and samples in microarray data be m and n, respectively. The computational cost of the first step, which finds 2-BCSs, is O(n2) since there are C(n,2) sample pairs. The second step is to find (p + 1)-BCSs from p-BCSs. For each BCS, at most n samples need to be examined and there are at most m groups during the gene grouping process. Also, we need to sort the genes with respect to s values. Thus, it takes O(nmlogm) when quick sort is used. Obtaining all (p + 1)-BCSs from all p-BCS occurs at most n times. Therefore, in total the RN+ algorithm approximately takes O(n2mlogm).
We implemented the RN+ biclustering algorithm using C++. We calculated the running time of RN+ by varying the number of rows (genes) while fixing the number of columns (samples) as 50. The RN+ algorithm is tested on a computer with 3.2 GHz quad-core Intel i5-6500 CPUs and 8 GB main memory. Figure 6 shows the running time of the RN+ algorithm with respect to the number of rows. We observe an almost linear relationship as the number of genes increases.

Running time of the RN+ with increasing number of genes when the number of columns (samples) is 50.
4.4. Comparison
A total of seven state-of-the-art biclustering algorithms were compared for the experiments: RN clustering (Ahn et al., 2011), Cheng and Church (CC) (Cheng and Church, 2000), ISA (Bergmann et al., 2003), OPSM (Ben-Dor et al., 2003), Plaid (Lazzeroni and Owen, 2000), QUBIC (Li et al., 2009), SAMBA (Tanay et al., 2002), and UniBIC (Wang et al., 2016). The author of the RN clustering algorithm provided the source code (Ahn et al., 2011), which was implemented in C++ (Table 3).
Biclustering Algorithms and Implementations Used in Our Experiments
CC, Cheng and Church; ISA, iterative signature algorithm; OPSM, Order-Preserving Submatrix; QUBIC, Qualitative Biclustering; UniBIC, sequential row-based biclustering algorithm for analysis of gene expression data.
4.5. GO evaluation
The number of biclusters found and enriched after filtering out the highly overlapped biclusters for each algorithm at a significance level of 5% is shown in Table 4. Biclusters are considered to be enriched if any GO term was smaller than p = 0.05 (Eren et al., 2012). The RN+ algorithm finds the largest number of biclusters compared with the other state-of-the-art biclustering algorithms we explored. The RN+ algorithm found 552 enriched biclusters from 561 discovered. It outperformed the RN algorithm in terms of both the enriched biclusters and the percentage of biclusters found. The ISA algorithm achieved 100% enriched proportion but found only 30 biclusters on five real datasets. The CC algorithm found 493 biclusters, but most of them are not biologically meaningful. Similarly, UniBIC found 495 biclusters, but only 68.2% of them are enriched. Figure 7 shows the proportions of the enriched biclusters for each algorithm at five different significance levels. The results of five real datasets are aggregated. We observed that the RN+ algorithm outperformed the RN algorithm at all significance levels. The ISA algorithm showed 100% proportion but only found 30 biclusters at all significance levels.

Proportion of the enriched biclusters for various biclustering algorithms on five different significance levels (p). CC, Cheng and Church; ISA, iterative signature algorithm; OPSM, Order-Preserving Submatrix; QUBIC, Qualitative Biclustering; UniBIC, sequential row-based biclustering algorithm for analysis of gene expression data.
Accumulated Number of Biclusters and Enriched Biclusters on Five Real Datasets at a Significance Level of 5%
5. Conclusion
In this article, we have proposed a novel biclustering method, RN+, for finding biologically meaningful biclusters in microarray data. The RN+ algorithm identified biologically significant biclusters by performing a unique gene filtering, gene searching, gene grouping, and queuing process, using PPI network data. Unlike previous approaches, the RN+ method performs biclustering only for highly related genes in the PPI network using the gene filtering process. It also uses a single priority queuing scheme to effectively construct a breadth-first tree, and removes redundant BCSs to identify diverse biclusters. We compared the RN+ algorithm with several competitive biclustering algorithms on five real datasets. The GO database was used to validate the biological significance. Experimental results show that RN+ was able to find a significantly larger number of functionally enriched and biologically meaningful biclusters compared with those of competitive biclustering algorithms.
Footnotes
Acknowledgment
This research was supported by the Incheon National University Research Grant in 2018.
Author Disclosure Statement
The authors declare that no competing financial interests exist.
