Abstract
Abstract
This article reports a new clustering method based on the k-means algorithm to high-dimensional gene expression data. The proposed approach makes use of bidirectional penalties to constrain the number of clusters and centroids of clusters to simultaneously determine the unknown number of clusters and handle large amounts of noise in gene expression data. Numeric studies indicate that this algorithm not only performs better in clustering but is also comparable to other approaches in its ability to obtain the correct number of clusters and correct signal features. Finally, we apply the proposed approach to analyze two benchmark gene expression datasets. These analyses again indicate that the proposed algorithm performs well in clustering high-dimensional gene expression data with an unknown number of clusters.
1. Introduction
C
Some previous approaches have focused on how to determine the number of clusters heuristically. The process of estimating how well a partition fits the structure underlying the data is known as cluster validation (Luxburg, 2010). The clustering stability criterion (Sun et al., 2012; Witten and Tibshirani, 2012), which measures the robustness of any given clustering algorithm, has been utilized to select the number of clusters through cross-validation (Fang and Wang, 2012). However, an inappropriate choice or application of the criterion can lead to unstable results. This motivated an alternative, the penalized regression-based clustering (PRclust) method, which was proposed for unsupervised learning without prior knowledge regarding the number of clusters (Pan et al., 2013). PRclust performs well to obtain the optimal number of clusters when it is applied to data with a few features, but it generates unstable results while clustering gene expression data.
For extracting useful information in the presence of a high level of background noise, one naïve idea is to do optimal subset selection (Li and Pati, 2016; Anzanello and Fogliatto, 2014), for example, clustering objects on subsets of attributes (COSA) (Friedman and Meulman, 2004). Similar to stepwise regression, these kinds of methods ignore the stochastic error that is inherent in the stages of feature selection, with a different subset of features leading these algorithms to different clustering performance. Because regularization techniques involve imposing certain prior distributions on model parameters to train simple and/or sparse models (Ma and Huang, 2008), they have been widely used in high-dimensional data analysis, especially after the successful application of l1-norm penalty (Tibshirani, 1996, 1997). Regularization techniques were also extended to clustering algorithms to achieve stable results, for example, the sparse clustering algorithm (Witten and Tibshirani, 2010) and the regularized k-means clustering (Sun et al., 2012).
In this article, we introduce a new regularization technique to determine the optimal number of clusters while exploring patterns of gene expression data, and accordingly propose the so-called bidirectional regularizations clustering algorithm (BiRClust) by employing the new regularization technique in a k-means clustering algorithm.
2. Methods
2.1. Notations and motivation
Given a dataset in matrix form
The idea of a regularization technique is to construct a fused lasso penalty to restrain the difference between any two centroids. Without loss of generality, let
where s is a tuning parameter to restrain the scale of centroids. If
2.2. The bidirectional regularizations clustering method
Besides constraining the number of clusters via the binding of the fused lasso, we also restrain centroids of features by the group lasso penalty (Yuan and Lin, 2006) to achieve a sparse model. The proposed clustering algorithm based on the k-means algorithm is defined as
where
To simplify the study, we measure the dissimilarity by Euclidean distance, and we specify the penalty function as the combination of fused lasso penalty and group lasso penalty. Note that the input data
where
3. Algorithm
3.1. Estimating centroids
We introduce the augmented lagrangians and method of multipliers (ADMM) (Bertsekas, 1997) to solve the problem in Equation (3). Let
where
Gradients of
and
respectively.
To simplify the computational process, let
Algorithm 1. Given tuning parameters
and
for
Bidirectional regularizations clustering algorithm
3.2. Selecting tuning parameters
Here, we are interested in models for more than one amount of regularization. Some criteria mentioned in previous studies are proposed to obtain optimal tuning parameters, such as the cross-validation (Kohavi, 1995; Zou, 2006), the generalized cross-validation (Liao and Ng, 2011; Sun et al., 2012), the generalized degrees of freedom (Pan et al., 2013), and the gap statistic (Tibshirani et al., 2001). According to the performance of the gap statistic, in this study, we introduce the genetic algorithm (GA) (Mallipeddi et al., 2011) to solve the maximization problem of the gap statistic and search for optimal tuning parameters. Let
Figure 1 shows the change process of clustering results based on a simulated dataset that contains three clusters that are generated according to case I (Section 4.1.) when tuning parameters are increased. In the beginning, centroids of clusters along all features maintain significant differences from each other when

Fixed
3.3. Setting adaptive weight
The adaptive weight wj can be viewed as an adjustment that can enhance or diminish the impact of the group lasso penalty for each feature. Obviously, noisy features will seriously interfere with clustering progress. We, therefore, assign relative large weights to noisy features, so that noisy features have little or no impact on clustering. According to the configuration in Sun et al. (2012), for each feature, the adaptive weight wj is defined as
where
3.4. Initializing centroids
Distinct from previous studies mentioned by Pan et al. (2013), we do not set the number of clusters in the proposed algorithm equal to the size of observations, because it costs a vast amount of computational time to search for the solution. According to the conclusion in Luxburg (2010), we empirically set the initial number of clusters to
3.5. Active set
Active-set methods are widely applied for solving l1 regularization problems. This method is proposed to optimize a linear least-squares objective function subject to a constraint on the l1 norm of the parameter vector (Schmidt, 2010). The idea of active set is to divide the features into two sets: The working set contains the signal features, and the active set contains the noisy features. In the proposed method, we only update the working set.
4. Results
4.1. Synthetic datasets
The numerical studies are conducted on a Win-7 64-bit platform and by using R software. The simulated dataset
To compare the performance of the proposed approach with others, we first apply the proposed algorithm, the classic k-means algorithm, the sparse k-means algorithm, (Sparcl), and the PRclust clustering algorithm (PRclust) to cluster 100 simulated datasets generated by each case. Then, we assess clustering results via clustering validation, the optimal number of clusters, and feature selection.
Let a set of n elements • • • F-score •
where n is the number of observations; a is the number of pairs of elements in S that are in the same set in X and in the same set in Y; b is the number of pairs of elements in S that are in different sets in X and in different sets in Y; c is the number of pairs of elements in S that are in the same set in X and in different sets in Y; and d is the number of pairs of elements in S that are in different sets in X and in the same set in Y.
Similarly, we compute the average optimal number of clusters for each algorithm based on 100 simulated datasets to show the performance of determining an unknown number of clusters, denoting the average optimal number of clusters as Kopt. Furthermore, times of selected features (SF) and the accuracy of feature selection (AF) are also conducted in simulation studies. Let ã
j
be the indicator variable whether feature j is selected in the simulation, (ã
j
= 1 if feature j is selected, and ã
j
= 0 otherwise); and let aj be the true setting in the simulation, (
Results of simulation studies first reveal that the proposed approach performs as well as or even better than other approaches in clustering. Some clustering validation indices are at the best level. Second, the proposed algorithm has consistently obtained more number of clusters than others, but it is close to the true number of clusters. In addition, the proposed method also performs extremely well in feature selections. It selects almost all signal features and sweeps all noisy features away. However, the PRClust and k-means are unable to recognize which features are noisy despite their ability to cluster high-level noisy data. Even the sparse k-means cannot select features correctly.
4.2. Real datasets
Similar to Sun et al. (2012), we apply the proposed approach to analyze two benchmark microarray datasets: leukemia and lymphoma. The leukemia dataset (Golub et al., 1999) contains data on two types of human acute leukemias: acute myeloid leukemia and acute lymphoblastic leukemia (Herland et al., 2014). Both datasets are provided by Dettling (2004) and available at http://stat.ethz.ch/∼dettling/bagboost.html. These two datasets are summarized in Table 4.
The data are pre-processed according to a previous study (Herland et al., 2014). After preprocessing, the leukemia and lymphoma datasets still have 3571 and 4026 genes, respectively. Similar to the simulation studies, all clustering algorithms are randomly started 100 times to overcome their dependence on the initialization. All clustering results are summarized in Table 5.
In the process of cluster analyzing, we found that the proposed algorithm always selects more genes because of the strong correlation among genes in the data. Let us rewrite the fused lasso penalty as follows:
Mixed-clustering algorithm
Results show that the proposed approach almost always finds the true number of clusters, and its clustering error is very similar to others. For the leukemia dataset, the proposed approach has classified two people into groups that are similar to others. For the lymphoma dataset, the proposed approach has divided all observations into four groups more than the true setting. But two people are allocated to a new cluster in this case, and another two people are clustered into an incorrect group. Obviously, clustering results based on mixed-clust is the combination of the result of sparse k-means and that of BiRClust. Thus, they are not presented in Table 5. We also present the frequency of selected genes based on 100 clustering analysis in Table 6.
Table 6 indicates that only a few genes are selected based on 100 clustering analysis. It indicates that the mixed-clustering algorithm is more efficient than BiRClust in selecting genes, and the BiRClust can be improved by the idea of sparse k-means.
5. Discussion
This article proposed a new clustering algorithm to cluster gene expression data without prior information on the number of clusters. The critical idea is to penalize not only the difference between paired centroids but also the centroids by the feature level. Results of numeric studies indicate that the algorithm performs well in clustering and in determining the number of clusters for a high-dimensional dataset.
However, there is still room to make further improvements. First, the dependence between any genes is still not accounted for (Pan et al., 2013) in our study. This structural information may be very important when we want to obtain more intact and accurate results. Second, different applications are required to measure the similarity between two observations. Sometimes, the data may not have come from a multi-norm distribution. We also need to discuss and present the proposed method's properties and also give the conditions of application. Finally, this algorithm can also be extended and implemented with other loss functions, such as the model-based clustering algorithm.
Footnotes
Acknowledgments
The author is grateful to Prof. Haiyan Huang of the Department of Statistics, University of California, Berkeley, CA, for her helpful discussion and comments; and to Katherine Musliner, PhD, at the National Center for Register-based Research, Aarhus University, for her kind help in revision. This work is supported by “The Young Teachers Development Foundation of Central University of Finance and Economics (QJJ1510),” “The Discipline Construction Foundation of Central University of Finance and Economics,” and “The Fundamental Research Funds for the Central Universities.”
Author Disclosure Statement
No competing financial interests exist.
