Abstract
Abstract
Biclustering, which performs simultaneous clustering of rows (e.g., genes) and columns (e.g., conditions), has been shown to be important for analyzing microarray data. To find biclusters, there have been many methods proposed. Most of these methods can find only clusters with coregulated patterns, which means that the expression levels of genes in a found cluster rise and fall simultaneously. However, for real microarray data, there exist negative-correlated patterns, which means that the tendencies of expression levels of some genes may be completely inverse to those of the other genes under some conditions. Although one method called Co-gclustering was proposed to simultaneously find clusters with correlated and negative-correlated patterns, its time complexity is exponential to the number of conditions, which may not be efficient. Therefore, in this article, we propose a new method, Up-Down Bit pattern (UDB), to efficiently find clusters with correlated and negative-correlated patterns. First, we utilize up-down bit patterns to record those condition pairs where one gene is upregulated or downregulated. One gene is upregulated (or downregulated) under condition pair a and b if its expression level shows an upward (or downward) tendency from condition a to condition b. Then, we apply a heuristic idea on these up-down bit patterns to efficiently find clusters, which will reduce the time complexity from exponential time to polynomial time. From the experimental results, we show that the UDB method is more efficient than the Co-gclustering method.
1. Introduction
Traditional clustering methods work in the full dimensional space, and can only be applied to either the rows or the columns of a data matrix, separately (Aguilar-Ruiz, 2005; Zhao and Zaki, 2005). However, investigations show that more often than not, several genes contribute to the same pathway, which motivates researchers to identify a subset of genes whose expression levels rise and fall coherently under a subset of conditions (Yang et al., 2005). Moreover, for traditional clustering methods, one object (i.e., one gene) is commonly assigned to one cluster. In fact, genes may participate in several biological functions and thus should be included in multiple clusters (Ihmels et al., 2004). Therefore, biclustering (Cheng and Church, 2000) was proposed, which does not have those limitations. Biclustering performs simultaneous clustering of rows and columns. If some objects (genes) are similar under some conditions (i.e., a subspace), they will be clustered together in that subspace. Biclustering has proved of great value for finding interesting patterns from microarray data (Zhao and Zaki, 2005).
There have been several types of biclusters proposed (Madeira and Oliveira, 2004). Among these types, biclusters with coherent values define a bicluster as a subset of genes and a subset of conditions of the microarray data which have coherent values of expression levels on both rows and columns (Madeira and Oliveira, 2004). Figure 1 shows an example, where Figure 1a shows the expression patterns of 6 genes of one microarray data matrix on the full space, i.e., all conditions, and Figure 1b shows a bicluster of 2 genes with coherent values on only the subspace, i.e., conditions a, b, c, and e. Methods for biclusters with coherent values (Chang et al., 2009; Cheng and Church, 2000; Wang et al., 2002; Yoon et al., 2005; Zhao and Zaki, 2005) aim to cluster those genes whose expression values have simple linear transformation relationships. Genes in such a bicluster have been shown to have significant biological relationships among these genes (Zhao and Zaki, 2005). However, the definition of biclusters with coherent values is sometimes too strict, since it only allows a simple linear transformation among the expression values of genes. For some microarray data matrices, it may be very difficult to find such a bicluster. Another type of biclusters, biclusters with coherent evolutions, does not have such a limitation. This type of biclusters aims to find a subset of genes up-regulated or down-regulated across a subset of conditions without taking into account their actual expression values (Madeira and Oliveira, 2004). Figure 2 shows an example of one bicluster of 3 genes with coherent evolutions for the expression data shown in Figure 1a.

An example of biclustering.

A bicluster of three genes with coherent evolutions.
In Zhao et al. (2008), they further considered the negative-correlated pattern. That is, the tendencies of expression values of some genes are completely inverse to those of the other genes. Figure 3 shows an example of the negative-correlated pattern for the expression data shown in Figure 1a, where the top gene shows an inverse tendency of expression values against the other three genes. There have been many genes proved to have such behavior. For example, for genes of the yeast, genes YLR367W and YKR057W share the same tendency, while they have a negative-correlated pattern against gene YML009C under 8 conditions (Zhao et al., 2008). It has been suggested that all these genes are involved in protein translation and translocation and should be grouped into the same cluster (Breitkreutz et al., 2006; Zhao et al., 2008). In this article, we will focus on the design of an efficient method for finding biclusters containing both correlated and negative-correlated patterns.

A bicluster of four genes containing both correlated and negative-correlated patterns.
According to the definition of a up/down regulation, methods of finding biclusters with coherent evolutions could be classified into two types: one focusing on the expression value of one gene under each single condition, and the other one focusing on the variation of expression values of one gene from one condition to another condition. For methods of the first type, e.g., SAMBA (Tanay et al., 2002) and BiModule (Okada et al., 2007), they usually define that one gene is up/down-regulated under one single condition if its expression level after standardizing with the mean (e.g., 0) and the variance (e.g., 1) is above/below a certain value (e.g., 1/−1) (Tanay et al., 2002). These methods aim to cluster those genes whose expression levels are commonly up/down-regulated under some conditions. However, they do not consider the relationship between expression levels of two conditions for one gene, which may not be precise enough occasionally. For example, these methods may find that both genes 1 and 2 are upregulated under both condition a and condition b. However, the expression levels of gene 1 from condition a to condition b may be increasing, while the expression levels of gene 2 from condition a to condition b may be decreasing. For methods of the second type, e.g., OPSM (Ben-Dor et al., 2003), OP-Cluster (Liu and Wang, 2003), and Co-gclustering (Zhao et al., 2008), they take into consideration the variation of expression values of one gene from one condition to another condition. In Ben-Dor et al. (2003), a submatrix of the expression data is an OPSM cluster if there is a permutation of its columns under which the sequence of values in every row is strictly increasing. The OP-Cluster method (Liu and Wang, 2003) also tries to find the OPSM clusters by transforming this problem into a sequential pattern mining-like problem. These methods consider the numeral order of expression levels between every two conditions, and aim to cluster those genes whose expression levels are increasing under one permutation of conditions. Although all the above methods may efficiently find clusters with correlated patterns, they can not find clusters with negative-correlated patterns. Therefore, the Co-gclustering method (Zhao et al., 2008) was proposed. This method utilizes a tree structure to gradually generate all the answers, and can simultaneously find clusters with correlated and negative-correlated patterns.
Although the Co-gclustering method can find clusters with correlated and negative-correlated patterns, the tree constructing process of this method may not be efficient. Assuming that the number of conditions is n, in the worst case, this method needs to generate (n − 1) trees, where each tree represents one combination of k conditions for 2 ≤ k ≤ n. The time complexity of this method is O(2 n ). Even though the Co-gclustering method utilizes a technique which reduces the number of trees being generated, for real microarray data, the effect of this technique is still limited and could not significantly reduce its time complexity.
Therefore, in this article, we propose a new method, Up-Down Bit pattern (UDB), to efficiently find clusters with correlated and negative-correlated patterns. In the UDB method, first, we utilize up-down bit patterns to record those condition pairs where one gene is upregulated or downregulated. One gene is upregulated (or downregulated) under condition pair a and b if its expression level shows an upward (or downward) tendency from condition a to condition b. For each gene, its up bit pattern and down bit pattern are two bit strings with length
The rest of this article is organized as follows. In Section 2, we give a survey of the Co-gclustering method. Section 3 presents the proposed UDB method. In Section 4, we study the performance of the UDB method and make a comparison with the Co-gclustering method. Finally, we make a conclusion in Section 5.
2. Related Work
In this section, we introduce the Co-gclustering method (Zhao et al., 2008). In 2008, Zhao et al. proposed the Co-gclustering method to find biclusters containing both correlated and negative-correlated patterns. Given a microarray data matrix with m genes and n conditions, for each gene gi, this method first finds all significant p-pairs (prototypal sequence pairs), (cj, ck). Assume that the expression levels of gene gi under condition cj and ck are di,j and di,k, respectively. A significant p-pair can be either upregulated when
Then, this method constructs Co-tree T2 from T1. Two p-pairs (ci, cj) and (cj, ck) in T1 are concatenated to form a new 2-segment ((ci, cj), (cj, ck)) in T2, where this 2-segment represents a path from the root node to the leaf node, i.e., (ci, cj, ck), of T2. For genes stored in the leaf nodes, they are also divided into two clusters, the S-cluster and the D-cluster. The S-cluster for segment ((ci, cj), (cj, ck)) is the union of all genes in (the ↗-cluster of (ci, cj) ∩ the ↗-cluster of (cj, ck)) and (the ↘-cluster of (ci, cj) ∩ the ↘-cluster of (cj, ck)). The D-cluster for segment ((ci, cj), (cj, ck)) is the union of all genes in (the ↗-cluster of (ci, cj) ∩ the ↘-cluster of (cj, ck)) and (the ↘-cluster of (ci, cj) ∩ the ↗-cluster of (cj, ck)). For genes in the same cluster, this method further utilizes a dissimilarity threshold, δ, to verify the similarity of every two genes. If two genes can not satisfy this threshold, they are divided into two different clusters. After checking the dissimilarity threshold, this method also checks whether the number of genes of each cluster is not less than α, where α is the parameter of the minimal number of genes of a found cluster. Genes in the same cluster form a Co-gcluster if the number of conditions along the path from the root node to the leaf node is not less than β, where β is the parameter of the minimal number of conditions of a found cluster.
For the rest of Co-trees, Tl where l ≥ 3, they are constructed from those previous Co-trees. Two segments of a Co-tree can be concatenated together to form a new path in the new Co-tree, if the last p-pair of the former segment is the same as the first p-pair of the later segment. This technique is called β-jumping in the Co-gclustering method. For example, for two 3-segments of T3, ((c1, c2), (c2, c3),
The Co-gclustering method utilizes the β-jumping technique to avoid generating all Co-trees Tl for l ≤ n. However, for real microarray data, we have no knowledge about the possible number of conditions of a found cluster. For the Co-gclustering method, if parameter β is set to a large value close to the maximal possible number of conditions of found clusters, there may exist some interesting clusters being dropped due to this large value of β. If parameter β is set to a value much less than the maximal possible number of conditions of found clusters, this method needs to construct all Co-trees Tl for l ≥ β. The time complexity in the worst case is O(2 n ), i.e., enumerating the power set of n conditions. Therefore, this tree constructing process may not be efficient.
3. Proposed Method
In this section, we present our method, Up-Down Bit pattern (UDB), for finding clusters which contain both coregulated and negative-coregulated patterns in microarray data. Table 1 shows the variables used in the UDB method, and Figure 4 shows the UDB method. The UDB method has three main steps: (1) determining up-down bit patterns of each gene, (2) clustering those genes which share the same tendency based on up-down bit patterns, and (3) post-processing the found clusters. We will describe these steps in the following subsections.

The UDB method.
3.1. Step 1
In Step 1, we determine the up-down bit patterns for each gene. For each gene, we compute the difference of expression values of this gene under every two conditions (denoted as Cond1 and Cond2). If this difference is positive, it means that the gene is upregulated (i.e., an upward tendency) from condition Cond1 to condition Cond2. If this difference is negative, it means that the gene is downregulated (i.e., a downward tendency) from condition Cond1 to condition Cond2.
Formally, the same as the definition in Zhao et al. (2008), we also have a threshold,

An example gene expression matrix.
We utilize up-down bit patterns to record the condition pairs where gene i is upregulated or downregulated. The up and down bit patterns of gene i are denoted as UDBPi.Up and UDBPi.Down, respectively. UDBPi.Up and UDBPi.Down are initially two bit strings with all bits equal to 0. One bit of UDBPi.Up (or UDBPi.Down) is set to 1, if gene i is significantly upregulated (or downregulated) under the condition pair corresponding to that bit. As shown in Step 1 of Figure 4, we use an increasing number, BitNum, to record the corresponding bit number for condition pair Cond1 and Cond2. The value of BitNum is one-to-one mapping to one combination of Cond1 and Cond2. Assume that the number of total conditions is n. The lengths of UDBPi.Up and UDBPi.Down are both
Figure 6 shows the up-down bit patterns for the example shown in Figure 5, where Figure 6a shows the differences of expression values between each condition pair, and Figure 6b shows the up-down bit patterns of each gene. The value of

Evaluation of up-down bit patterns: (a) the differences of expression values between each condition pair; (b) the up-down bit patterns of each gene.
3.2. Step 2
In this step, we cluster the similar genes according to their up-down bit patterns determined in Step 1. As shown in Step 2 of Figure 4, for each gene i, we first group it as a new cluster. Then, for each of the rest of genes except gene i, denoted as gene j, it is similar to gene i, if it satisfies the following Rule 1 or Rule 2, where “|one bit string|” means the number of on-bits (“1”) in this bit string:
In this case, we add gene j to the cluster of gene i.
For Rule 1, the term “UDBPi.
Similarly, for Rule 2, it considers the negative-coregulation. That is, when gene i is upregulated (or downregulated), gene j will be downregulated (or upregulated). Therefore, we consider “UDBPi.
In Step 2 of Figure 4, based on gene i, all the other genes except gene i will be checked whether they can be added into the cluster of gene i. Then, we check the number of genes within this cluster. Only if this number is above a user specified parameter, NR, this cluster is added into the result set of clustering, Ans. With the similar process, we continue to try to generate a new cluster for the next gene i in G until all genes in G are processed.
Biclustering is originally a NP-Complete problem (Aguilar-Ruiz, 2005; Yang et al., 2005; Yoon et al., 2005; Zhao and Zaki, 2005). Therefore, in the worst case, the time complexity of biclustering methods (Zhao et al., 2008) is O(2 n ), where n is the number of conditions. In Step 2 of the proposed UDB method, instead of enumerating the power set of n conditions (which is often used in those previous methods), we utilize a heuristic idea, i.e., two loops with time complexity O(m2), to speed up the process of finding clusters, where m is the number of genes. The benefit is that we reduce the time complexity from exponential time to polynomial time. Our pay is that not every two genes in a found cluster will satisfy the similarity threshold, Sim. For example, assume that we have three genes, e.g., genes 1, 2, and 3. The up bit patterns for genes 1, 2, and 3 are “11111,” “11110,” and “01111,” respectively. The down bit patterns for these genes are all “00000.” If the similarity threshold, Sim, is 0.8, the correct clusters should be {gene 1, gene 2} and {gene 1, gene 3}. In our Step 2, gene 2 and gene 3 will be added into the cluster of gene 1, since “|11111 AND 11110| = 4 ≥ 0.8 × 5” and “|11111 AND 01111| = 4 ≥ 0.8 × 5.” Next, gene 1 is added into the cluster of gene 2. Then, gene 1 is added into the cluster of gene 3. Finally, we find three clusters, {gene 1, gene 2, gene 3}, {gene 1, gene 2}, and {gene 1, gene 3}. The second and the third clusters are the same as the correct ones. Genes in the first cluster do not all satisfy the similarity threshold, since gene 2 (where UDBP2.Up = “11110”) and gene 3 (where UDBP3.Up = “01111”) are not similar to each other based on Sim = 0.8. To redeem this problem, we will post-process the found clusters in Step 3 later.
3.3. An improved version of Step 2
For Step 2 described in the previous subsection, we do not record the condition pairs of one cluster where genes in this cluster are coregulated or negative-coregulated under these condition pairs. In this subsection, we present an improved version, Step 2*, which records the condition pairs.
Figure 7 shows the new process of Step 2*. The basic idea of Step 2* is the same as that of Step 2, which also utilizes two loops with time complexity O(m2). One difference is that for each cluster, Clust, it contains the following two fields: (1) GeneSet, a set for storing the genes, and (2) CondBit, a bit string for indicating those condition pairs where genes in GeneSet are coregulated or negative-coregulated. Another difference is that we use two bit strings, UpBit and DownBit, to help us derive CondBit of one cluster. UpBit and DownBit are used to record those condition pairs where genes in this cluster are upregulated and downregulated, respectively. Note that one bit can keep to be “1” after AND operations only if this bit is “1” in all bit strings being applied with AND operations. By using this idea, we can utilize AND operations to find those on-bits occurring in the up-down bit patterns of all genes in one cluster.

The new Step 2*.
In Figure 7, initially, we create a new cluster, Clust. Then, for each gene i, we first add it to Clust.GeneSet. UpBit and DownBit are set to UDBPi.Up and UDBPi.Down, respectively. For each of the other genes, denoted as gene j, if it satisfies Rule 1 with gene i, we add gene j to Clust.GeneSet. Moreover, we update
Take the up-down bit patterns shown in Figure 6b as an example. Assume Sim = 0.9. First, we create a new cluster, Clust, and add gene 1 to Clust.GeneSet.
3.4. Step 3
In the previous step, we have found the clustering result stored in Ans according to parameters Sim (the similarity threshold) and NR (the minimal number of genes). In Step 3, we further present two post-processes for the found clusters which are to merge those similar clusters and to compute the similarity score of each cluster, respectively.
The process of Step 3 is shown in Figure 8. First, for every two clusters c1 and c2, if their gene sets overlap with each other by more than (MT × 100) percent, we merge them into a new cluster, where MT is the user-specified merging threshold and 0 ≤ MT ≤ 1. The purpose of this process is to reduce the number of clusters which may be redundant. Since the input data may be noisy, finding too many clusters with large overlaps only makes it difficult for the users to select those important clusters (Zhao and Zaki, 2005). In this case, the gene set of the new cluster is the union result of c1.GeneSet and c2.GeneSet. The bit string of condition pairs of the new cluster is the AND result of c1.CondBit and c2.CondBit. This process is optional for users. When MT is set to 1, the merging process will not be performed.

Step 3.
Next, we compute the similarity scores for clusters stored in Ans. For cluster x, its similarity score, SC(x), is evaluated by the following Equation 1:
In other words, the similarity score is the percentage of on-bits in x.CondBit. This is because each on-bit represents one condition pair where all genes in x.GeneSet are coregulated. The more similar the genes of cluster x are to each other, the larger the number of on-bits in x.CondBit is.
After evaluating the similarity scores, we can provide a ranking for these clusters according to their similarity scores. There may exist clusters whose similarity scores are less than the similarity threshold, Sim. We call clusters whose similarity scores are not less than Sim as true clusters, and clusters whose similarity scores are less than Sim as potential clusters. The potential clusters are generated due to any of the following two processes: (1) Step 2/2* and (2) merging. The reason for the first process (i.e., Step 2/2*) has been mentioned previously. That is, we apply a heuristic method with time complexity O(m2) to find the clusters. Therefore, we may find some clusters whose genes do not completely satisfy the similarity threshold. For such clusters, instead of dropping them, we output them as potential clusters. This is because such clusters may still contain some biological meanings, although not every two genes within them satisfy the user-specified similarity threshold, Sim. The larger the number of genes in a found cluster which do not satisfy the threshold, the lower the similarity score of this cluster. Therefore, in Step 3 of the UDB method, we redeem the problem of those clusters which are generated in Step 2/2* and do not satisfy the threshold by outputting them as potential clusters. The reason of generating potential clusters for the second process (i.e., merging) is that even if two clusters before merging satisfy the similarity threshold, there is no guarantee that the new cluster after merging will satisfy the similarity threshold. Therefore, although the merging process could reduce the number of redundant clusters, the number of true clusters may also decrease, i.e., a tradeoff.
4. Performance
In this section, we study the performance of the UDB method. We implemented the UDB method in Java, and performed all experiments on a Fedora Linux virtual machine (1024 MB of memory, an 2.4 GHz Intel processor) over Windows XP through VMware middleware.
4.1. Simulation results on synthetic data sets
In this subsection, we study the efficiency of the proposed UDB method by several synthetic data sets. Table 2 shows the related parameters for generating these synthetic data sets. Each synthetic data set is simulated by a 2-dimensional matrix with M rows and N columns. Then, X Co-gclusters with size (EM rows × EN columns) and
The default parameters for the Co-gclustering method and the UDB method are described as follows. The dissimilarity threshold in the Co-gclustering method is set to ∞ to find the same answers as the UDB method. The parameters of the minimal number of genes of a found cluster, i.e., α in the Co-gclustering method and NR in the UDB method, are set to the same value as EM. For the Co-gclustering method, the parameter of the minimal number of conditions of a found cluster, β, is set to EN. Since there does not exist such a parameter in the UDB method, we set the value of parameter Sim to (
For the data set of Case 1, we study the relationship between the processing time and the number of genes in a microarray data set, and make a comparison between the Co-gclustering method and the UDB method. Figure 9 shows the simulation result for the data set of Case 1. From this figure, we could observe that the processing time of the UDB method increases as the number of genes (i.e., M) increases, since the time complexity of the UDB method is O (M2). Moreover, the UDB method needs shorter processing time than the Co-gclustering method for this data set. The reason is that the number of conditions greatly affects the processing time of the Co-gclustering method. In this simulation, such an effect makes the Co-gclustering method need longer processing time than the UDB method even when the number of genes is large.

A comparison of the processing time for the data set of Case 1.
For the data set of Case 2, we study the relationship between the processing time and the number of conditions in a microarray data set, and compare the processing time of the UDB method with that of the Co-gclustering method. Figure 10 shows the simulation result for the data set of Case 2. From this figure, we can observe that the UDB method needs shorter processing time than the Co-gclustering method for this data set. In Figure 10, the processing time of the Co-gclustering method seems to increase exponentially as the number of conditions (i.e., N) increases, since its time complexity is O(2
N
). Although the processing time of the Co-gclustering method may be shorter than that of the UDB method when the value of N is small, it increases quickly as the value of N increases. On the other hand, the processing time of the UDB method does not have obvious variations as the number of conditions increases. The reason is that the number of conditions only affects the length (i.e.,

A comparison of the processing time for the data set of Case 2.
For the data set of Case 3, we study the relationship between the processing time and the number of conditions in a found Co-gcluster (i.e., EN), and make a comparison between the Co-gclustering method and the UDB method. For the Co-gclustering method, the parameter of the minimal number of conditions of a Co-gcluster, β, is set to 12. Parameter Sim in the UDB method is set to

A comparison of the processing time for the data set of Case 3.
4.2. Experimental results on real data sets
In this subsection, first, we use real microarray data sets as the experimental data to study the efficiency of the proposed method. Table 4 shows the real microarray data sets used in the experiments. The yeast microarray data set (Tavazoie et al., 1999), obtained from the yeast Saccharomyces Cerevisiae cell cycle expression levels, is widely used in microarray clustering research. The ALL-AML microarray data set (Brunet et al., 2004), often used in microarray classification research, is composed of samples from 27 ALL (acute lymphoblastic leukemia) patients and 11 AML (acute myeloid leukemia) patients. The breast microarray (West et al., 2001) consists of 49 breast cancer samples, obtained from the Duke Breast Cancer SPORE frozen tissue bank.
Table 5 shows the experimental results of these three data sets. From this figure, we could observe that the UDB method needs shorter processing time than the Co-gclustering method for these data sets. The reason is that the UDB method is a polynomial-time method, while the Co-gclustering method is an exponential-time method. The ALL-AML data set consists of more genes and more samples than the yeast data set, and the breast data set consists of more genes and more samples than the ALL-AML data set. Since the time complexity of the UDB method is O(M2) (where M is the number of genes), the processing time of the UDB method for these three data sets increases progressively. However, the processing time of the Co-gclustering method grows exponentially to the number of conditions. Therefore, we could observe that as the number of conditions among these data sets increases, the difference of the processing time between the Co-gclustering method and the UDB method also increases.
Next, we discuss the correlation of genes in clusters found by the UDB method. We use the yeast microarray data set (composed of 2884 rows and 17 columns) as the experimental data set. For the UDB method, when parameters are set to
We then try to find the corresponding Co-gclusters by the Co-gclustering method with equivalent parameters. These equivalent parameters are described as follow. The values of
5. Conclusion
Biclustering of microarray data has been shown to be instrumental in microarray data analysis. In this article, we have proposed the UDB method, which can efficiently find clusters with correlated and negative-correlated patterns. The UDB method utilizes up-down bit patterns to record those condition pairs where one gene is upregulated or downregulated. Then, by applying a heuristic idea on these up-down bit patterns, we can find all clusters in polynomial time. From the simulation results on synthetic microarray data sets, we have studied the factors in the data sets which affect the processing time of the UDB method. Moreover, from the experimental results on synthetic and real microarray data sets, we have shown that the UDB method is more efficient than the Co-gclustering method. Furthermore, we have shown the correlation of genes in clusters found by the UDB method for the yeast microarray data set.
Footnotes
Acknowledgments
This research was supported in part by the National Science Council of Republic of China (Grant NSC-95-2221-E-110-079-MY2) and by the “Aim for Top University Plan” project of NSYSU and the Ministry of Education, Taiwan.
Disclosure Statement
No competing financial interests exist.
