Studies on the Clustering Algorithm for Analyzing Gene Expression Data with a Bidirectional Penalty

Abstract

This article reports a new clustering method based on the k-means algorithm to high-dimensional gene expression data. The proposed approach makes use of bidirectional penalties to constrain the number of clusters and centroids of clusters to simultaneously determine the unknown number of clusters and handle large amounts of noise in gene expression data. Numeric studies indicate that this algorithm not only performs better in clustering but is also comparable to other approaches in its ability to obtain the correct number of clusters and correct signal features. Finally, we apply the proposed approach to analyze two benchmark gene expression datasets. These analyses again indicate that the proposed algorithm performs well in clustering high-dimensional gene expression data with an unknown number of clusters.

1. Introduction

Clustering analysis, one of the unsupervised machine-learning algorithms, has been widely used in many fields (Thalamuthu et al., 2006). The goal of clustering algorithm is to automatically divide observations into groups, such that similar data objects are within one cluster, whereas dissimilar ones are assigned to different clusters, possibly separating out noise (Assent, 2012). Cluster analysis is typically an important step in data mining. The application of the clustering algorithm to gene expression data often faces two challenging problems (Jiang et al., 2004). The first problem is the uncertainty in the true number of clusters. The second problem is that gene expression data often contain a large amount of noise.

Some previous approaches have focused on how to determine the number of clusters heuristically. The process of estimating how well a partition fits the structure underlying the data is known as cluster validation (Luxburg, 2010). The clustering stability criterion (Sun et al., 2012; Witten and Tibshirani, 2012), which measures the robustness of any given clustering algorithm, has been utilized to select the number of clusters through cross-validation (Fang and Wang, 2012). However, an inappropriate choice or application of the criterion can lead to unstable results. This motivated an alternative, the penalized regression-based clustering (PRclust) method, which was proposed for unsupervised learning without prior knowledge regarding the number of clusters (Pan et al., 2013). PRclust performs well to obtain the optimal number of clusters when it is applied to data with a few features, but it generates unstable results while clustering gene expression data.

For extracting useful information in the presence of a high level of background noise, one naïve idea is to do optimal subset selection (Li and Pati, 2016; Anzanello and Fogliatto, 2014), for example, clustering objects on subsets of attributes (COSA) (Friedman and Meulman, 2004). Similar to stepwise regression, these kinds of methods ignore the stochastic error that is inherent in the stages of feature selection, with a different subset of features leading these algorithms to different clustering performance. Because regularization techniques involve imposing certain prior distributions on model parameters to train simple and/or sparse models (Ma and Huang, 2008), they have been widely used in high-dimensional data analysis, especially after the successful application of l₁-norm penalty (Tibshirani, 1996, 1997). Regularization techniques were also extended to clustering algorithms to achieve stable results, for example, the sparse clustering algorithm (Witten and Tibshirani, 2010) and the regularized k-means clustering (Sun et al., 2012).

In this article, we introduce a new regularization technique to determine the optimal number of clusters while exploring patterns of gene expression data, and accordingly propose the so-called bidirectional regularizations clustering algorithm (BiRClust) by employing the new regularization technique in a k-means clustering algorithm.

2. Methods

2.1. Notations and motivation

Given a dataset in matrix form \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \bf{X}} = { \left( {{X_{1 \cdot }} , {X_{2 \cdot }} , \ldots , {X_{n \cdot }}} \right) ^{ \rm{T}}}$$ \end{document} that has n observations ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${X_{i \cdot }} = { \left( {{x_{i1}} , {x_{i2}} , \ldots , {x_{ip}}} \right) ^{ \rm{T}}}$$ \end{document} is the ith observation). Dually, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \bf{X}}$$ \end{document} can be also represented as the matrix of features \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\left( {{X_{ \cdot 1}} , {X_{ \cdot 2}} , \ldots , {X_{ \cdot p}}} \right)$$ \end{document} , where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${X_{ \cdot j}} = { \left( {{x_{1j}} , {x_{2j}} , \ldots , {x_{nj}}} \right) ^{ \rm{T}}}$$ \end{document} is the jth feature. Assume that observations are divided into \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \rm{K}}$$ \end{document} clusters, and C_k is one set of observations in the kth cluster that contains n_k observations. The corresponding clustering centroid of the kth cluster is \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${U_{k \cdot }} = { \left( {{u_{k1}} , {u_{k2}} , \ldots , {u_{kp}}} \right) ^{ \rm{T}}}$$ \end{document} for \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$k = 1 , 2 , \ldots , { \rm{K}}$$ \end{document} . Let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \rm{d}} \left( {{X_{k \cdot }} , {X_{l \cdot }}} \right) \in \left[ {0 , + \infty } \right)$$ \end{document} be the dissimilarity measure between two points \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${X_{k \cdot }}$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${X_{l \cdot }}$$ \end{document} . We introduce the total sum of dissimilarity \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\mathop \sum \limits_{k = 1}^K \mathop \sum \limits_{i \in {C_k}} d \left( {{X_{i \cdot }} , {U_{k \cdot }}} \right)$$ \end{document} as a criterion to divide observations into \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \rm{K}}$$ \end{document} clusters. Given the number of clusters \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \rm{K}}$$ \end{document} , an optimal centroid set \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \bf{U}} = \left\{ {U_{k \cdot }^{opt} \vert k = 1 , 2 , \ldots , K} \right\} $$ \end{document} that minimizes the criterion can be obtained. A different number of clusters K yields different clustering centroids. Hence, to achieve optimal cluster validation, it is necessary to try many different Ks to obtain the global optimal solution.

The idea of a regularization technique is to construct a fused lasso penalty to restrain the difference between any two centroids. Without loss of generality, let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$h \left( { \parallel {U_{k \cdot }} - {U_{l \cdot }}{ \parallel _1}} \right)$$ \end{document} ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$k , l = 1 , 2 , \ldots , { \rm{K}};k \ne l$$ \end{document} ) be the fused lasso penalty (Tibshirani et al., 2005). The unsupervised learning algorithm is defined as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} \begin{matrix} { \mathop {{ \rm{arg \min}}} \limits_{ \bf{U}} \left\{ { \mathop \sum \limits_{k = 1}^{ \rm{K}} \mathop \sum \limits_{{X_{i \cdot }} \in {C_k}} d \left( {{X_{i \cdot }} , {U_{k \cdot }}} \right) } \right\} } \\ {{ \rm{s}}.{ \rm{t}}. \mathop \sum \limits_{k < l} h \left( { \parallel {U_{k \cdot }} - {U_{l \cdot }}{ \parallel _1}} \right) \le s} \\ \end{matrix} , \tag {1} \end{align*} \end{document}

where s is a tuning parameter to restrain the scale of centroids. If \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \rm{K}} = n$$ \end{document} , criterion (1) becomes the approach mentioned in Pan et al. (2013). Meanwhile, if \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \rm{d}} \left( {{X_{i \cdot }} , {U_{k \cdot }}} \right)$$ \end{document} is set as the Euclidean distance, and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \rm{K}} < n$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \rm{s}} \to \infty$$ \end{document} , criterion (1) is the k-mean algorithm. Although criterion (1) can divide gene expression data into groups, the curse of dimensionality will lead the algorithm into yielding unstable and inaccurate results.

2.2. The bidirectional regularizations clustering method

Besides constraining the number of clusters via the binding of the fused lasso, we also restrain centroids of features by the group lasso penalty (Yuan and Lin, 2006) to achieve a sparse model. The proposed clustering algorithm based on the k-means algorithm is defined as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} { { \mathop {{ \rm{arg \ min}}} \limits_{ \bf{U}} \left\{ { \mathop \sum \limits_{k = 1}^{ \rm{K}} \mathop \sum \limits_{{X_{i \cdot }} \in {C_k}} d \left( {{X_{i \cdot }} , {U_{k \cdot }}} \right) } \right\}}} \\ {{\rm s.t.}} \left\{ \begin{matrix} { \mathop \sum \limits_{k < l} h \left( { \parallel {U_{k \cdot }} - {U_{l \cdot }}{ \parallel _1}} \right) \le {s_1}} \\ { \mathop \sum \limits_{j = 1}^p \eta \left( { \parallel {U_{ \cdot j}}{ \parallel _2}} \right) \le {s_2}}\end{matrix} \right. , \tag {2} \end{align*} \end{document}

where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${{ \rm{s}}_1}$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${{ \rm{s}}_2}$$ \end{document} are tuning parameters. Let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\widehat{\bf{U}}$$ \end{document} be the solution of function \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\left( 2 \right)$$ \end{document} . \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${\widehat { \bf{U}}}$$ \end{document} can be represented as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\widehat {{ \bf{U}}}= \left\{ {{{ \widehat{U}}_{k \cdot }} \vert k = 1 , 2 , \ldots , { \rm{K ^\prime }}} \right\} $$ \end{document} or \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${\widehat{\bf{U}}}= \left\{ {{{{\widehat U}} }_{ \cdot j}} \vert \ j = 1 , 2 , \ldots , p ^\prime \right\} $$ \end{document} , which is a \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \rm{K ^\prime }} \times p ^\prime$$ \end{document} matrix. Obviously, the role of these two penalties is to restrain the number of centroids in two directions, and it is the so-called bidirectional penalty.

To simplify the study, we measure the dissimilarity by Euclidean distance, and we specify the penalty function as the combination of fused lasso penalty and group lasso penalty. Note that the input data \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \bf{X}}$$ \end{document} is standardized with zero mean \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\left( \frac { 1 } { n } \mathop \sum \limits_ { i = 1 } ^n { x_ { ij } } = 0\right)$$ \end{document} along each feature before clustering. According to the Karush–Kuhn–Tucker (KKT) condition (Boyd et al., 2006), centroids \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${\widehat{ \bf{U}}}$$ \end{document} can be optimized by \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} \mathop { { { \widehat { { \bf { U } } } } = \mathop { { \rm { arg \ min } } } \limits_ { \bf { U } } \left\{ { \frac { 1 } { 2 } \mathop \sum \limits_ { k = 1 } ^ { \rm { K } } \mathop \sum \limits_ { { X_ { i \cdot } } \in { C_k } } \parallel { X_ { i \cdot } } - { U_ { k \cdot } } { \parallel ^2 } + { \lambda _1 } \mathop \sum \limits_ { k < l } h \left( { \parallel { U_ { k \cdot } } - { U_ { l \cdot } } { \parallel _1 } } \right) + { \lambda _3 } \mathop \sum \limits_ { j = 1 } ^p { w_j } \parallel { U_ { \cdot j } } { \parallel _2 } } \right\} \ } } \tag {3} \end{align*} \end{document}

3. Algorithm

3.1. Estimating centroids

We introduce the augmented lagrangians and method of multipliers (ADMM) (Bertsekas, 1997) to solve the problem in Equation (3). Let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \Theta _{ \left( {kl} \right) \cdot }} = {U_{k \cdot }} - {U_{l \cdot }}$$ \end{document} . The term \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\mathop \sum \limits_{k < l} \parallel {U_{k \cdot }} - {U_{l \cdot }}{ \parallel _1}$$ \end{document} then incorporates an additional quadratic penalty term \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$ \frac { { { \lambda _1 } } } { 2 } \mathop \sum \limits_ { k \ne l } \parallel { U_ { k \cdot } } - { U_ { l \cdot } } - { \theta _ { \left( { kl } \right) \cdot } } \parallel _2^2 + \frac { { { \lambda _2 } } } { 2 } \mathop \sum \limits_ { l \ne k } \parallel { \theta _ { \left( { kl } \right) \cdot } } { \parallel _1 } $$ \end{document} . Function (3) may, therefore, be rewritten as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} {\left\{ \mathop {\widehat {\bf{U}}} ; \mathop { {\widehat {\bf{ \Theta}}}} \right\} } = \mathop {{ \rm{arg \ min}}} \limits_{{ \bf{U}};{ \bf{ \Theta }}} \left\{ {{1 \over 2} \mathop \sum \limits_{k = 1}^{ \rm{K}} \mathop \sum \limits_{{X_{i \cdot }} \in {C_k}} \parallel {X_{i \cdot }} - {U_{k \cdot }} \parallel _2^2 + { \lambda _1} \mathop \sum \limits_{k \ne l} \parallel {U_{k \cdot }} - {U_{l \cdot }} - { \theta _{ \left( {kl} \right) \cdot }} \parallel _2^2 + { \lambda _2}} \mathop \sum \limits_{k \ne l} \parallel { \theta _{ \left( {kl} \right) \cdot }}{ \parallel _1} + { \lambda _3} \mathop \sum \limits_{j = 1}^p {w_j} \parallel {U_{ \cdot j}}{ \parallel _2} \right\} \tag {4} \end{align*} \end{document}

Gradients of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$f \left( {{ \bf{U}};{ \bf{ \Theta }}} \right)$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$g \left( {{ \bf{U}};{ \bf{ \Theta }}} \right)$$ \end{document} in the \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${u_{kj}}$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \theta _{ \left( {kl} \right) j}}$$ \end{document} directions are \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} \begin{cases} \frac{\partial f \left( {\bf U}; \bf{ \Theta}\right)} {\partial {u_{kj}}} &\displaystyle= - \sum_{{X_{i \cdot }} \in {C_k}} x_{ij} + n_k u_{kj}\\ \frac{\partial g( {\bf U}; \bf{ \Theta }) } {\partial {u_{kj}}} &\displaystyle= \lambda _1 \left\{ \left( \rm{K} - 1 \right) {u_{kj}} - \sum_{k > l} \left( u_{kj} + \theta _{( {kl} ) j} \right) - \sum_{k < l} \left( {{u_{kj}} - { \theta _{ \left( {kl} \right) j}}} \right) \right\} + { \lambda _3}\frac{ w_j u_{kj}}{\sqrt { \sum_{k = 1}^K u_{kj}^2}} \end{cases} \end{align*} \end{document}

and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} \begin{cases} \frac{\partial f \left( \bf{U}; \bf{ \Theta} \right) } {\partial \theta _{ \left( {kl} \right) j}} &= 0\\ \frac{ \partial g \left( \bf{U}; \bf{\Theta} \right) } {\partial \theta _{ \left( {kl} \right) j}} &= { \lambda _1}{ \theta _{ \left( {kl} \right) j}} - { \lambda _1} \left( {{u_{kj}} - {u_{lj}}} \right) + { \lambda _2}sign \left( {{ \theta _{ \left( {kl} \right) j}}} \right) \end{cases} \end{align*} \end{document}

respectively.

To simplify the computational process, let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${a_{kj}} = \mathop \sum \limits_{k \,>\, l} \left( {{u_{kj}} + { \theta _{ \left( {kl} \right) j}}} \right) + \mathop \sum \limits_{k \,<\, l} \left( {{u_{kj}} - { \theta _{ \left( {kl} \right) j}}} \right)$$ \end{document} , \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${b_{kj}} = \mathop \sum \limits_{{X_{i \cdot }} \in {C_k}} {x_{ij}}$$ \end{document} , and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$ { c_ { kj } } = \left( { { n_k } + { \lambda _1 } \left( { { \rm { K } } - 1 } \right) } \right) + { \lambda _3 } \frac { \textstyle { { w_j } } } { \textstyle { \sqrt { \mathop \sum \nolimits_ { k = 1 } ^K u_ { kj } ^2 } } } $$ \end{document} . According to Pan et al. (2013) and Teng and Huang (2009), the coordinate-wise descent algorithm is utilized to estimate the unknown parameters. \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\widehat {u} _{kj}^{ \ \left( {s + 1} \right) }$$ \end{document} is estimated by the following algorithm:

Algorithm 1. Given tuning parameters \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \lambda _1}$$ \end{document} , \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \lambda _2}$$ \end{document} , and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \lambda _3}$$ \end{document} , according to the KKT condition, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\widehat {u }_{kj}^{ \ \left( {s + 1} \right) }$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${\widehat {\theta}} _{ \left( {kl} \right) j}^{ \ \left( {s + 1} \right) }$$ \end{document} are updated, respectively, by \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} {\widehat u}_{kj}^{ \left( {s + 1} \right) } = \left\{ { \begin{matrix} {{{b_{kj}^{ \left( s \right) } + { \lambda _1}a_{kj}^{ \left( s \right) }} \over {c_{kj}^{ \left( s \right) }}} , \ if \ {{ \left( { \mathop \sum \limits_{k = 1}^K {{ \left( {b_{kj}^{ \left( s \right) } + { \lambda _1}a_{kj}^{ \left( s \right) }} \right) }^2}} \right) }^{{1 \over 2}}} > { \lambda _3}{w_j}} \\ {0 , \ \ \ \ \ \ \ \ \ \ \ \ \ otherwise} \\ \end{matrix} } \right. \ \tag {5} \end{align*} \end{document}

and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} {\widehat {\theta}}_{ \left( {kl} \right) j}^{ \left( {s + 1} \right) } = \left\{ \begin{matrix} { \ } \\ {S \left( { {\widehat {u}}_{kj}^{ ({s + 1})} - {\widehat {u}}_{lj}^{( {s + 1} )} , {{{ \lambda _2}} \over {{ \lambda _1}}}} \right) , \ if \ \left\vert { {\widehat {u}}_{kj}^{({s + 1})} - {\widehat {u}}_{lj}^{ ( {s + 1}) }} \right\vert > {{{ \lambda _2}} \over {{ \lambda _1}}}} \\ {0 , \ \quad\quad\quad\quad\quad\quad\quad {\rm otherwise}}\end{matrix} \right. \tag {6} \end{align*} \end{document}

for \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$k , \ l = 1 , 2 , \ldots , { \rm{K}}$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$k \ne l$$ \end{document} ; \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$j = 1 , \ 2 , \ \ldots , \ p$$ \end{document} . \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$S \left( {a , b} \right) = { \rm{sign}} \left( a \right) { \left( { \left\vert a \right\vert - b} \right) _ { + }}$$ \end{document} is the soft-thresholding rule (Donoho and Johnstone, 1994), and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \left( x \right) _ { + }}$$ \end{document} takes the positive part of a: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \left( x \right) _ { + } = x}$$ \end{document} if \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$x > 0$$ \end{document} ; \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \left( x \right) _ { + } = 0}$$ \end{document} otherwise.

Remark: According to Algorithm 1, centroids of feature j in different clusters will be updated if and only if \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$ { \left( { \mathop \sum \limits_ { k = 1 } ^K { { \left( { b_ { kj } ^ { \left( s \right) } + { \lambda _1 } a_ { kj } ^ { \left( s \right) } } \right) } ^2 } } \right) ^ { \frac { 1 } { 2 } } } > { \lambda _3 } { w_j } $$ \end{document} ; otherwise, it will be taken as a noisy feature and swept from clustering. Similarly, the difference \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \widehat {\theta} _{ \left( {..} \right) j}}$$ \end{document} between any two centroids along feature j will be updated if and only if \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\left\vert { { \widehat { u } _ { kj } } ^ { \left( { s + 1 } \right) } - { \widehat { u } } _ { lj } ^ { ( { s + 1 } ) } } \right\vert > { \frac { { \lambda _2 } } { { \lambda _1 } } } $$ \end{document} ; otherwise, it will be set to zero. These two processes enable the proposed algorithm to select features and determine the clustering number from high-dimensional data.

Algorithm 1:

Bidirectional regularizations clustering algorithm

Input: the dataset of observations

, the set of initial centroids, and the fixed tuning parameters (

, and

Output: Clustering results

Process:

Initializing parameters

and

Begin iteration until convergence, for

For each cluster and each feature (

(a) computing

, and

(b) If

\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$ { \left( { \mathop \sum \limits_ { k = 1 } ^K { { \left( { b_ { kj } ^ { \left( s \right) } + { { \rm { \lambda } } _1 } a_ { kj } ^ { \left( s \right) } } \right) } ^2 } } \right) ^ { \frac { 1 } { 2 } } } > { { \rm { \lambda } } _3 } { w_j } $$ \end{document}

, then update

; otherwise,

, then update

\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\hat { \theta } _ { \left( { kl } \right) j } ^ { \left( { s + 1 } \right) } = { \rm { S } } \left( { { \hat { u } } _ { kj } ^ { \left( { s + 1 } \right) } - { \hat { u } } _ { lj } ^ { \left( { s + 1 } \right) } , { \frac { { { \rm { \lambda } } _2 } } { { { \rm { \lambda } } _1 } } } } \right)$$ \end{document}

; otherwise,

Finally, compute the convergence criterion:

(a)

\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$ { { \rm { \delta } } ^ { \left( { s + 1 } \right) } } = { \frac { \mathop \sum \nolimits_ { k = 1 } ^ { \rm { K } } \mathop \sum \nolimits_ { j = 1 } ^p \left\vert { { \hat { u } } _ { kj } ^ { \left( { s + 1 } \right) } - { \hat { u } } _ { kj } ^ { \left( s \right) } } \right\vert } { \mathop \sum \nolimits_ { k = 1 } ^ { \rm { K } } \mathop \sum \nolimits_ { j = 1 } ^p \left\vert { { \hat { u } } _ { kj } ^ { \left( s \right) } } \right\vert } } $$ \end{document}

;

(b) If

, stop the process of iteration. Otherwise,

and repeat it.

3.2. Selecting tuning parameters

Here, we are interested in models for more than one amount of regularization. Some criteria mentioned in previous studies are proposed to obtain optimal tuning parameters, such as the cross-validation (Kohavi, 1995; Zou, 2006), the generalized cross-validation (Liao and Ng, 2011; Sun et al., 2012), the generalized degrees of freedom (Pan et al., 2013), and the gap statistic (Tibshirani et al., 2001). According to the performance of the gap statistic, in this study, we introduce the genetic algorithm (GA) (Mallipeddi et al., 2011) to solve the maximization problem of the gap statistic and search for optimal tuning parameters. Let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \bf{X}}$$ \end{document} be the dataset that contains n observations, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${{ \bf{X}}_b} \ \left( {b = 1 , 2 , \ldots , B} \right)$$ \end{document} be the permuted dataset for which each feature was independently permuted from \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \bf{X}}$$ \end{document} randomly, and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$l \left( {{ \bf{X}} \vert { \lambda _1} , { \lambda _2} , { \lambda _3}} \right)$$ \end{document} be the loss function (e.g., the k-means in this study) of the given tuning parameters. Thus, the gap statistic is defined as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} { \rm { Gap } } \left( { { \lambda _1 } , \lambda _2^ { \rm { ^\prime } } } \right) = \log \left( { l \left( { { \bf { X } } \vert { \lambda _1 } , \lambda _2^ { \rm { ^\prime } } } \right) } \right) - \frac { 1 } { B } \mathop \sum \limits_ { b = 1 } ^B \log \left( { l \left( { { { \bf { X } } _b } \vert { \lambda _1 } , \lambda _2^ { \rm { ^\prime } } } \right) } \right) . \tag {7} \end{align*} \end{document}

Figure 1 shows the change process of clustering results based on a simulated dataset that contains three clusters that are generated according to case I (Section 4.1.) when tuning parameters are increased. In the beginning, centroids of clusters along all features maintain significant differences from each other when \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${{ \rm{ \lambda }}_2}$$ \end{document} is set at a small value. But some centroids of noise features in different clusters begin to get close together as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${{ \rm{ \lambda }}_2}$$ \end{document} increases. When \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${{ \rm{ \lambda }}_2}$$ \end{document} reaches a relatively large value, some centroids of noise features in different clusters become the same value and are removed from clustering according to their contribution. Similarly, the proposed algorithm makes centroids of two clusters converge as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${{ \rm{ \lambda }}_2}$$ \end{document} increases. We can, therefore, introduce a GA to find optimal tuning parameters.

FIG. 1.

Fixed \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${{ \rm{ \lambda }}_1}$$ \end{document} , \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${{ \rm{ \lambda }}_2}$$ \end{document} = 0.5, and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${{ \rm{ \lambda }}_3}$$ \end{document} = 8, solution paths of centroids in different clusters for the proposed approach based on a simulated dataset that contains three clusters generated according to case I.

3.3. Setting adaptive weight

The adaptive weight w_j can be viewed as an adjustment that can enhance or diminish the impact of the group lasso penalty for each feature. Obviously, noisy features will seriously interfere with clustering progress. We, therefore, assign relative large weights to noisy features, so that noisy features have little or no impact on clustering. According to the configuration in Sun et al. (2012), for each feature, the adaptive weight w_j is defined as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} { { \rm { w } } _j } = \frac { 1 } { { \mathop { \parallel { \widehat { U } } } _ { \cdot j } ^ { k - means } \parallel _2^2 } } \end{align*} \end{document}

3.4. Initializing centroids

Distinct from previous studies mentioned by Pan et al. (2013), we do not set the number of clusters in the proposed algorithm equal to the size of observations, because it costs a vast amount of computational time to search for the solution. According to the conclusion in Luxburg (2010), we empirically set the initial number of clusters to \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\sqrt n$$ \end{document} , where n is the number of observations.

3.5. Active set

Active-set methods are widely applied for solving l₁ regularization problems. This method is proposed to optimize a linear least-squares objective function subject to a constraint on the l₁ norm of the parameter vector (Schmidt, 2010). The idea of active set is to divide the features into two sets: The working set contains the signal features, and the active set contains the noisy features. In the proposed method, we only update the working set.

4. Results

4.1. Synthetic datasets

The numerical studies are conducted on a Win-7 64-bit platform and by using R software. The simulated dataset \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \bf{X}}$$ \end{document} includes n observations with p features belonging to K clusters. For each cluster, only signal features contribute information to the cluster; other features are noisy features that provide nothing but disturbance.

Case I. Scenario (a). This case contains four clusters, and each cluster has 40 observations. Eight signal features are applied to generate these clusters, and 42 other features are used to generate noisy information. A simulated dataset of the ith cluster is sampled from multi-normal distribution \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \rm{N}} \left( {{ \mu _i} , {{ \bf{ \Sigma }}_i}} \right)$$ \end{document} randomly and independently. The centroids \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \mu _i}$$ \end{document} are re-sampled from the sequence of (−2, −1, 0, 1, 2), and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${{ \bf{ \Sigma }}_i}$$ \end{document} is an \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$8 \times 8$$ \end{document} covariance matrix with 1 at the diagonal entries and 0.8 otherwise. Noisy features are generated by a uni-normal distribution \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \rm{N}} \left( {0 , 0.2} \right)$$ \end{document} randomly and independently. Scenario (b). All settings are the same as Scenario (a), except the number of features. This case contains 15 signal features and 985 noisy features.

Case II. This case contains the synthetic datasets generated in case I, and also 20 outliers generated by a uni-normal distribution \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \rm{N}} \left( {0 , 0.2} \right)$$ \end{document} .

Case III. All settings are the same as the two scenarios in case I, except the covariance matrix \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${{ \bf{ \Sigma }}_i}$$ \end{document} . Each diagonal element of the covariance matrix \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${{ \bf{ \Sigma }}_i}$$ \end{document} in this case is re-sampled from the sequence of (1, 2, 3).

To compare the performance of the proposed approach with others, we first apply the proposed algorithm, the classic k-means algorithm, the sparse k-means algorithm, (Sparcl), and the PRclust clustering algorithm (PRclust) to cluster 100 simulated datasets generated by each case. Then, we assess clustering results via clustering validation, the optimal number of clusters, and feature selection.

Let a set of n elements \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$S = \left\{ {{o_1} , {o_2} , \ldots , {o_n}} \right\} $$ \end{document} and two partitions of S to compare, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\textbf{\textit{C}} = \left\{ {{c_1} , {c_2} , \ldots , {c_r}} \right\} $$ \end{document} , the true partition of S into r subsets, and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \bf{ \Omega }} = \left\{ {{ \omega _1} , { \omega _2} , \ldots , { \omega _q}} \right\} $$ \end{document} , the estimated partition of S into q subsets be given. The rand index, jaccard index (jaccard), F-score, and purity (Liu et al., 2010) are applied to evaluate the performance of clustering algorithms in this study. Here, the formulas to compute these indices are as follows:

• \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$Rand = \left( {a + d} \right) / M$$ \end{document} , \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$M = n \left( {n - 1} \right) / 2$$ \end{document} ,

• F-score \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$= 2 \left( {{ \rm{preciesion}} \cdot { \rm{recall}}} \right) / \left( {{ \rm{precision}} + { \rm{recall}}} \right)$$ \end{document} ,

where n is the number of observations; a is the number of pairs of elements in S that are in the same set in X and in the same set in Y; b is the number of pairs of elements in S that are in different sets in X and in different sets in Y; c is the number of pairs of elements in S that are in the same set in X and in different sets in Y; and d is the number of pairs of elements in S that are in different sets in X and in the same set in Y.

Similarly, we compute the average optimal number of clusters for each algorithm based on 100 simulated datasets to show the performance of determining an unknown number of clusters, denoting the average optimal number of clusters as Kopt. Furthermore, times of selected features (SF) and the accuracy of feature selection (AF) are also conducted in simulation studies. Let ã_j be the indicator variable whether feature j is selected in the simulation, (ã_j = 1 if feature j is selected, and ã_j = 0 otherwise); and let a_j be the true setting in the simulation, ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${a_j} = 1$$ \end{document} if feature j is signal, and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${a_j} = 0$$ \end{document} otherwise). SF and AF are, therefore, defined as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$SF = \frac { 1 } { Q } \mathop \sum\limits_ { i } ^ { Q } \mathop \sum\limits_ { j } ^ { p } \tilde { a } _j$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$AF = \frac { 1 } { Q } \mathop \sum\limits_ { i } ^ { Q } \mathop \sum\limits_ { j } ^ { p } I \left( \tilde { a } _j = a_j \right)$$ \end{document} , respectively, where Q is times of simulation studies, p is the number of features, and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$I \left( x \right)$$ \end{document} is the indicator function ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\left( x \right) = 1$$ \end{document} if x is true, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$I \left( x \right) = 0$$ \end{document} otherwise). All results are conducted based on 100 simulated data. The mean values of all criteria over 100 simulated datasets are summarized in Tables 1 –3.

Table 1.

Results Based on 100 Simulated Datasets in Case I (Four Clusters and 160 Observations)

Dataset	Algorithm	Kopt	Rand	Jaccard	F-score	Purity	SF	AF (%)
Eight signal features and 42 noisy features	BiRClust	4.92	0.91	0.76	0.86	0.90	6.75	0.96
	PRclust	12.03	0.93	0.75	0.79	0.97	—	—
	k-means	2.18	0.65	0.40	0.62	0.49	—	—
	Sparcl	4.15	0.91	0.74	0.78	0.88	48.62	0.17
Fifteen signal features and 985 noisy features	BiRClust	4.71	0.93	0.85	0.88	0.95	11.32	0.96
	PRclust	154.29	0.73	0.00	0.05	0.96	—	—
	k-means	1.96	0.62	0.38	0.58	0.47	—	—
	Sparcl	3.61	0.89	0.80	0.84	0.85	929.04	0.05

Table 2.

Results Based on 100 Simulated Datasets in Case II (Four Clusters, 160 Observations, and 20 Outliers)

Dataset	Algorithm	Kopt	Rand	Jaccard	F-score	Purity	SF	AF (%)
Eight signal features and 42 noisy features	BiRClust	5.33	0.92	0.73	0.82	0.89	6.54	0.96
	PRclust	13.31	0.93	0.76	0.78	0.97	—	—
	k-means	2.28	0.59	0.33	0.55	0.44	—	—
	Sparcl	4.26	0.87	0.64	0.76	0.79	49.07	0.16
Fifteen signal features and 985 noisy features	BiRClust	5.52	0.93	0.83	0.85	0.94	10.87	0.95
	PRclust	172.17	0.76	0.00	0.05	0.96	—	—
	k-means	2.04	0.56	0.31	0.56	0.42	—	—
	Sparcl	3.91	0.88	0.76	0.84	0.82	913.70	0.06

Table 3.

Results Based on 100 Simulated Datasets in Case III (Four Clusters and 160 Observations)

Dataset	Algorithm	Kopt	Rand	Jaccard	F-score	Purity	SF	AF (%)
Eight signal features and 42 noisy features	BiRClust	5.76	0.89	0.68	0.73	0.89	6.86	0.94
	PRclust	29.81	0.87	0.59	0.32	0.93	—	—
	k-means	2.29	0.59	0.33	0.60	0.45	—	—
	Sparcl	3.95	0.83	0.59	0.69	0.73	47.62	0.15
Fifteen signal features and 985 noisy features	BiRClust	5.17	0.79	0.66	0.64	0.81	10.00	0.83
	PRclust	150.00	0.66	0.00	0.04	0.83	—	—
	k-means	1.67	0.46	0.26	0.65	0.37	—	—
	Sparcl	3.50	0.75	0.68	0.70	0.72	833.33	0.01

Results of simulation studies first reveal that the proposed approach performs as well as or even better than other approaches in clustering. Some clustering validation indices are at the best level. Second, the proposed algorithm has consistently obtained more number of clusters than others, but it is close to the true number of clusters. In addition, the proposed method also performs extremely well in feature selections. It selects almost all signal features and sweeps all noisy features away. However, the PRClust and k-means are unable to recognize which features are noisy despite their ability to cluster high-level noisy data. Even the sparse k-means cannot select features correctly.

4.2. Real datasets

Similar to Sun et al. (2012), we apply the proposed approach to analyze two benchmark microarray datasets: leukemia and lymphoma. The leukemia dataset (Golub et al., 1999) contains data on two types of human acute leukemias: acute myeloid leukemia and acute lymphoblastic leukemia (Herland et al., 2014). Both datasets are provided by Dettling (2004) and available at http://stat.ethz.ch/∼dettling/bagboost.html. These two datasets are summarized in Table 4.

Table 4.

Summary of Two Benchmark Datasets

Data	Clusters	No. of observations	No. of genes
Leukemia	Acute myeloid leukemia	25	6817
	Acute lymphoblastic leukemia	47
Lymphoma	Large B cell lymphoma	42	4026
	Follicular lymphoma (Anzanello and Fogliatto)	9
	B cell chronic lymphocytic leukemia	11

The data are pre-processed according to a previous study (Herland et al., 2014). After preprocessing, the leukemia and lymphoma datasets still have 3571 and 4026 genes, respectively. Similar to the simulation studies, all clustering algorithms are randomly started 100 times to overcome their dependence on the initialization. All clustering results are summarized in Table 5.

Table 5.

The Selected Numbers of Clusters, Informative Genes, and Clustering Errors of Clustering Two Benchmark Datasets

Data	Methods	No. of clusters	Clustering error	No. of genes
Leukemia	k-means	Fixed = 2	2/72	3571
	Sparcl	Fixed = 2	2/72	43
	BiRClust	2	2/72	2468
Lymphoma	k-means	Fixed = 3	4/62	4026
	Sparcl	Fixed = 3	2/62	36
	BiRClust	4	3/62	488

In the process of cluster analyzing, we found that the proposed algorithm always selects more genes because of the strong correlation among genes in the data. Let us rewrite the fused lasso penalty as follows: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\mathop \sum \limits_{m < l} \parallel {U_{m \cdot }} - {U_{l \cdot }}{ \parallel _1} = \mathop \sum \limits_{l = 1}^n \mathop \sum \limits_{m = l + 1}^n \mathop \sum \limits_j^p \left\vert {{u_{lj}} - {u_{mj}}} \right\vert = \mathop \sum \limits_j^p \mathop \sum \limits_{l = 1}^n \mathop \sum \limits_{m = l + 1}^n \left\vert {{u_{lj}} - {u_{mj}}} \right\vert$$ \end{document} . It indicates that the fused lasso not only makes two centroids approach each other but can also be viewed as making two features of centroids approach each other. Hence, the proposed algorithm selects more genes that have a relatively strong correlation. To solve this problem, we introduce the sparse k-means to overcome the shortage of the proposed algorithm. The improved clustering algorithm is called the mixed-clustering algorithm (mixed-clust), Algorithm 2.

Algorithm 2:

Mixed-clustering algorithm

Input: the gene expression dataset.

Output: cluster and gene selection.

Process:

First step: Obtain optimal tuning parameters by cross-validation;

Second step: Fix optimal tuning parameters to obtain the optimal number of clusters via calling algorithm I.

Third step: Fix the number of clusters to select genes via calling the sparse k-means procedure.

Results show that the proposed approach almost always finds the true number of clusters, and its clustering error is very similar to others. For the leukemia dataset, the proposed approach has classified two people into groups that are similar to others. For the lymphoma dataset, the proposed approach has divided all observations into four groups more than the true setting. But two people are allocated to a new cluster in this case, and another two people are clustered into an incorrect group. Obviously, clustering results based on mixed-clust is the combination of the result of sparse k-means and that of BiRClust. Thus, they are not presented in Table 5. We also present the frequency of selected genes based on 100 clustering analysis in Table 6.

Table 6.

The Frequency of Selected Genes Based on 100 Clustering Analyses

Frequency range	Leukemia	Lymphoma
0	3483 (97.5%)	3916 (97.3%)
(0, 50)	27 (0.8%)	69 (1.7%)
[50, 100]	61 (1.7%)	42 (1%)

Table 6 indicates that only a few genes are selected based on 100 clustering analysis. It indicates that the mixed-clustering algorithm is more efficient than BiRClust in selecting genes, and the BiRClust can be improved by the idea of sparse k-means.

5. Discussion

This article proposed a new clustering algorithm to cluster gene expression data without prior information on the number of clusters. The critical idea is to penalize not only the difference between paired centroids but also the centroids by the feature level. Results of numeric studies indicate that the algorithm performs well in clustering and in determining the number of clusters for a high-dimensional dataset.

However, there is still room to make further improvements. First, the dependence between any genes is still not accounted for (Pan et al., 2013) in our study. This structural information may be very important when we want to obtain more intact and accurate results. Second, different applications are required to measure the similarity between two observations. Sometimes, the data may not have come from a multi-norm distribution. We also need to discuss and present the proposed method's properties and also give the conditions of application. Finally, this algorithm can also be extended and implemented with other loss functions, such as the model-based clustering algorithm.

Footnotes

Acknowledgments

The author is grateful to Prof. Haiyan Huang of the Department of Statistics, University of California, Berkeley, CA, for her helpful discussion and comments; and to Katherine Musliner, PhD, at the National Center for Register-based Research, Aarhus University, for her kind help in revision. This work is supported by “The Young Teachers Development Foundation of Central University of Finance and Economics (QJJ1510),” “The Discipline Construction Foundation of Central University of Finance and Economics,” and “The Fundamental Research Funds for the Central Universities.”

Author Disclosure Statement

No competing financial interests exist.

References

Anzanello

M.J.

, and Fogliatto

F.S.

2014. A review of recent variable selection methods in industrial and chemometrics applications. Eur. J. Ind. Eng., 8, 619–645.

Assent

2012. Clustering high dimensional data. In Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, CHDD 2012, Naples, Italy, volume 2, 340–350.

Bertsekas

D.P.

1997. Nonlinear programming. J. Oper. Res. Soc., 48, 334–334.

Boyd , Vandenberghe and Faybusovich . 2006. Convex optimization. IEEE Trans. Automatic Control, 51, 1859–1859.

Dettling

2004. BagBoosting for tumor classification with gene expression data. Bioinformatics, 20, 3583–3593.

Donoho

D.L.

, and Johnstone

I.M.

1994. Threshold selection for wavelet shrinkage of noisy data. In Engineering in Medicine and Biology Society, 1994. Engineering Advances: New Opportunities for Biomedical Engineers. Proceedings of the 16th Annual International Conference of the IEEE, Balfimore, MD, volume 1, IEEEA24–A25.

Fang

, and Wang

2012. Selection of the number of clusters via the bootstrap method. Comput. Stat. Data Anal., 56, 468–477.

Friedman

J.H.

, and Meulman

J.J.

2004. Clustering objects on subsets of attributes (with discussion). J. R. Stat. Soc. Series B Stat. Methodol. 66, 815–849.

Golub

T.R.

, Slonim

D.K.

, Tamayo

, et al. 1999. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286, 531–537.

10.

Herland

, Khoshgoftaar

T.M.

, and Wald

2014. A review of data mining using big data in health informatics. J. Big Data, 1, 1.

11.

Jiang

, Tang

, and Zhang

2004. Cluster analysis for gene expression data: A survey. IEEE Trans. Knowledge Data Eng. 16, 1370–1386.

12.

Kohavi

1995. A study of cross-validation and bootstrap for accuracy estimation and model selection. In IJCAI. Montreal, Canada. 1137–1145.

13.

, and Pati

2016. Variable selection using shrinkage priors. Comput. Stat. Data Anal. 107, 107–119.

14.

Liao

, and Ng

M.K.

2011. Blind deconvolution using generalized cross-validation approach to regularization parameter estimation. IEEE Trans. Image Process, 20, 670–680.

15.

Liu

, Li

, Xiong

, et al. 2010. Understanding of internal clustering validation measures. International Conference on Data Mining, Sydney, Australia. Pges 911–916.

16.

Luxburg

U.V.

2010. Clustering stability: An overview. Found. Trends Machine Learn. 2, 2010.

17.

, and Huang

2008. Penalized feature selection and classification in bioinformatics. Brief Bioinform. 9, 392–403.

18.

Mallipeddi

, Suganthan

P.N.

, Pan

Q.K.

, et al. 2011. Differential evolution algorithm with ensemble of parameters and mutation strategies. Appl. Soft Comput. 11, 1679–1696.

19.

Pan

, Shen

, and Liu

2013. Cluster analysis: Unsupervised learning via supervised learning with a non-convex penalty. J. Machine Learn. Res., 14, 1865–1889.

20.

Schmidt

2010. Graphical model structure learning with l1-regularization. Thesis: The University of British Columbia, 2010. Pgs 1–175.

21.

Sun

, Wang

, and Fang

2012. Regularized k-means clustering of high-dimensional data and its asymptotic consistency. Electron. J. Stat. 6, 148–167.

22.

Teng

S.L.

, and Huang

2009. A statistical framework to infer functional gene relationships from biologically interrelated microarray experiments. J. Am. Stat. Assoc. 104, 465–473.

23.

Thalamuthu

, Mukhopadhyay

, Zheng

, et al. 2006. Evaluation and comparison of gene clustering methods in microarray analysis. Bioinformatics, 22, 2405–2412.

24.

Tibshirani

1996. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Series B Methodol. 58, 267–288.

25.

Tibshirani

1997. The lasso method for variable selection in the Cox model. Stat. Med. 16, 385–395.

26.

Tibshirani

, Saunders

, Rosset

, et al. 2005. Sparsity and smoothness via the fused lasso. J. R. Stat. Soc. Series B Stat. Methodol. 67, 91–108.

27.

Tibshirani

, Walther

, and Hastie

2001. Estimating the number of clusters in a data set via the gap statistic. J. R. Stat. Soc. Series B Stat. Methodol. 63, 411–423.

28.

Witten

D.M.

, and Tibshirani

2010. A framework for feature selection in clustering. J. Am. Stat. Assoc., 105, 713–726.

29.

Yuan

, and Lin

2006. Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Series B Stat. Methodol. 68, 49–67.

30.

Zou

2006. The adaptive lasso and its oracle properties. J. Am. Stat. Assoc., 101, 1418–1429.