Robust multi-class feature selection via l 2,0 -norm regularization minimization

Abstract

Feature selection is an important data preprocessing in data mining and machine learning, that can reduce the number of features without deteriorating model’s performance. Recently, sparse regression has received considerable attention in feature selection task due to its good performance. However, because the $l_{2,0}$ -norm regularization term is non-convex, this problem is hard to solve, and most of the existing methods relaxed it by $l_{2,1}$ -norm. Unlike the existing methods, this paper proposes a novel method to solve the $l_{2,0}$ -norm regularized least squares problem directly based on iterative hard thresholding, which can produce exact row-sparsity solution for weights matrix, and features can be selected more precisely. Furthermore, two homotopy strategies are derived to reduce the computational time of the optimization method, which are more practical for real-world applications. The proposed method is verified on eight biological datasets, experimental results show that our method can achieve higher classification accuracy with fewer number of selected features than the approximate convex counterparts and other state-of-the-art feature selection methods.

Keywords

Feature selection l2 0-norm regularization iterative hard thresholding embedded method

1. Introduction

Feature selection, a process of selecting a subset of the raw features which are the most relevant and informative, has been widely investigated for many years [1, 2, 3, 4]. Due to the ability of reducing feature dimension, alleviating the effect of the curse of dimensionality, enhancing data understanding, speeding up the learning process and improving model’s performance, feature selection has become an essential component in data mining and machine learning, and has been widely applied in many practices, e.g., pattern recognition [5], text mining [6, 7], and bioinformatics [8, 9].

In General, feature selection methods can be categorized into three kinds depending on how they combine the feature selection search with model learning algorithms: filter methods, wrapper methods, and embedded methods. The filter methods use the intrinsic properties of data to select features before model learning, thus are independent of the learning algorithms. Typical filter methods include Relief [10], Chi-square and information gain [11], mRMR [12], etc. The wrapping methods treat the model learning algorithms as a black box to evaluate the selected features, such as support vector machine recursive feature elimination (SVM-RFE) [13] and correlation-based feature selection (CFS) [14]. Embedded methods incorporate the feature selection and model learning into a single optimization problem, such that higher computational efficiency and classification performance can be gained than filter methods and wrapped methods. Therefore, the embedded methods have gained large attention in recent years. Considering this, we study embedded methods in this paper.

Recently, with the development of sparsity researches, sparse regularization has been widely used in embedded feature selection methods. For binary classification task, the feature selection problem can be tackled by $l_{0}$ -norm minimization [15] directly, in which features corresponding to non-zero weights are selected. However, the non-convex and non-smooth of $l_{0}$ -norm make it very hard to solve. Most methods relax the $l_{0}$ -norm by $l_{1}$ -norm to make the minimization problem be convex and easy to solve, these methods were called LASSO [16]. Although some strategies such as one-to-all or one-to-one can be used to expand LASSO for multi-class feature selection task, structural sparsity models are more desirable so that the shared pattern of sparsity can be obtained. Inspired by that, many multi-class feature selection methods have been proposed based on structural sparsity model [17]. Nie et al. [18] firstly proposed a robust feature selection (RFS) method which combines a $l_{2,1}$ -norm loss function with a $l_{2,1}$ -norm regularization. The joint $l_{2,1}$ -norm minimization model has shown its superiority on feature selection, and is more robust to outliers. After that, the $l_{2,1}$ -norm regularization has been widely used in multi-class feature selection methods, e.g., UDFS [19], L-FS [20], RLSR [21], URAFS [22], etc.

Although satisfactory results can be achieved by using $l_{2,1}$ -norm regularization for multi-class feature selection, there are some limitations. First, $l_{2,1}$ -norm is only an approximation for $l_{2,0}$ -norm, thus the solution is different from the original optimal solution essentially. Second, Qian et al. [23] proved that $l_{2,1}$ -norm over-penalizes features with large weights, which will lead to an unfair competition between different features and hurt data approximation performance. Last, it is hard to tune the regularization factor of $l_{2,1}$ -norm to obtain exact row-sparsity solution, even a large regularization factor (e.g., $10^{5}$ ) cannot produce strong row-sparsity. Therefore, it is not clear that how many features need to be selected after the regularization factor tuned. Consequently, it is significant to solve the original $l_{2,0}$ -norm regularization problem.

In [24], Cai et al. proposed a robust and pragmatic multi-class feature selection (RPMFS) method based on $l_{2,0}$ -norm equality constrained optimization problem. RPMFS combines the $l_{2,1}$ -norm loss function with a $l_{2,0}$ -norm equality constraint, and uses the augmented Lagrangian method (ALM) to solve this equality constrained problem. In [25], Pang et al. also proposed an efficient sparse feature selection method (ESFS) based on $l_{2,0}$ -norm equality constrained problem. The equality constrained model is transformed into the same structure as linear discriminate analysis (LDA) so that the ratio of inter-class distance to intra-class distance of each feature can be calculated. However, the $l_{2,0}$ -norm equality constrained based methods exist two defects. First, the number of selected features needs to be pre-defined to construct the equality constraint. When tuning the number of selected features, the optimization program must be run repeatedly, which will cost a large amount of time and effort and is unpractical. Second, the overall performances of these methods are very sensitive to the number of selected features, in order to obtain satisfactory classification results, the number of selected features need to be tuned carefully. Literatures on sparsity research show that the regularized problem is more effective than the equality constraint problem to find a sparse solution.

Therefore, this paper proposes a novel and simple framework for multi-class feature selection that solves the original $l_{2,0}$ -norm regularization least squares problem (denoted as $l_{2,0}$ -FS) directly, in which a homotopy iterative hard thresholding method (HIHT) is introduced to perform the optimization. What’more, an accelerated version of HIHT (AHIHT) is derived to reduce the computational time. After learning, an exact row-sparsity solution is produced and the features can be selected in group. The contribution of this proposed $l_{2,0}$ -FS is that an efficient and effective optimization algorithm is designed to solve the $l_{2,0}$ -norm regularization least squares problem and produce row-sparsity solution for multi-class feature selection. After learning, the number of selected features can be fast tuned without rerunning the optimization program. To evaluate the performance of this proposed method, we use the selected features for classification, and compare the results with Baseline (without feature selection) and six supervised feature selection methods in terms of accuracy (Acc) and the number of selected features (No.fea) over eight biological data sets. The experimental results show that the performance of our approach is much better than the compared methods.

The remainder of this paper is organized as follows: Section 2 gives the notations and definitions used in this paper, and related works on $l_{2,1}$ -norm regularized problem and $l_{2,0}$ -norm equality constrained problem based multi-class feature selection methods are introduced. In Section 3, the proposed method $l_{2,0}$ -FS is presented and the optimization algorithms are introduced. Section 4 shows the experimental results, and conclusions and future work are given in Section 5.

2. Related work

2.1 Notations and definitions

This subsection summarizes the notations and definitions used in this paper. Matrices are presented as boldface uppercase letters, and vectors are presented as boldface lowercase letters. For a matrix $\bm{X}=\{x_{ij}\}\in R^{n\times m}$ , $\bm{x}^{i}$ and $\bm{x}_{j}$ denote its $i^{\text{th}}$ row and $j^{\text{th}}$ column, respectively. For a vector $\bm{x}\in R^{n}$ , $x_{i}$ denotes its $i^{\text{th}}$ element.

The $p$ -norm ( $p\neq 0$ ) of the vector $\bm{x}$ is defined as

$\displaystyle||\bm{x}||_{p}=\left(\sum_{i=1}^{n}{|x_{i}|^{p}}\right)^{1/p}.$

The $l_{0}$ -norm of the vector $\bm{x}$ is defined as

$\displaystyle||\bm{x}||_{0}=\sum_{i=1}^{n}{\sigma(x_{i})}\text{\,\,where\,\,}% \sigma(x_{i})=\left\{\begin{array}[]{l}\text{1, }\text{if }x_{i}\neq 0\\ \text{0, }\text{if }x_{i}=0\\ \end{array}\right.,$

which is a pseudo-norm and used to calculate the amount of non-zero elements of $\bm{x}$ .

The Frobenius norm of a given matrix $\bm{X}$ is defined as

$\displaystyle||\bm{X}||_{F}=\sqrt{\sum_{i=1}^{n}{\sum_{j=1}^{m}{x_{ij}^{2}}}}=% \sqrt{tr(\bm{X}^{T}\bm{X})},$

and the $l_{2,1}$ -norm of $\bm{X}$ is defined as

$\displaystyle||\bm{X}||_{2,1}=\sum_{i=1}^{n}{||\bm{x}^{i}||_{2}}=\sum_{i=1}^{n% }{\sqrt{\sum_{j=1}^{m}{x_{ij}^{2}}}}.$

The $l_{2,0}$ -norm of matrix $\bm{X}$ is defined as

$\displaystyle||\bm{X}||_{2,0}=\sum_{i=1}^{n}{1_{||\bm{x}^{i}||_{2}\neq 0}},$

where ${1}_{A}$ stands for the indicator function. For a scalar $x$ , if $x\neq 0$ , ${1}_{x}=1$ , otherwise ${1}_{x}=0$ . Also, the $l_{2,0}$ -norm is a pseudo-norm, which is used to calculate the number of non-zero rows in $\bm{X}$ . If a matrix has a large number of zero rows, we say it has the property of row-sparsity.

2.2 Multi-class feature selection based on $l_{2,1}$ -norm

In general, most multi-class feature selection algorithms based on $l_{2,1}$ -norm regularization can be formulated as follows:

$\displaystyle\underset{\bm{W,b}}{\min}||\bm{W}^{T}\bm{X}+\bm{b}\textbf{1}^{T}-% \bm{Y}||_{F}^{2}+\lambda||\bm{W}||_{2,1},$ (1)

where $\bm{X}=\{\bm{x}_{1},\bm{x}_{2},\ldots,\bm{x}_{N}\}\in R^{d\times N}$ is the training data. $\bm{Y}=\{\bm{y}_{1},\bm{y}_{2},\ldots,\bm{y}_{N}\}\in B^{C\times N}$ is the binary label matrix with $y_{ij}=1$ if $\bm{x}_{i}$ belongs to class $j$ ; otherwise $y_{ij}=0$ . $\bm{W}\in R^{d\times C}$ denotes the model weights and $\bm{b}\in R^{C\times 1}$ denotes the learned biased vector. $\textbf{1}\in R^{N\times 1}$ denotes the column vector with all elements being 1. $N$ denotes the sample size, $C$ is the number of classes, and $d$ is the feature dimension. $\lambda$ is the regularized factor. After optimizing, the features of $\bm{X}$ are selected according to the magnitude of $||\bm{w}^{i}||_{2}$ .

This problem has been widely studied and a lot of variants have been proposed. Yuan et al. [26] proposed a Group LASSO as an extension of the LASSO by using the $l_{2,1}$ -norm for group feature selection. After the work of [26], the Group LASSO has been applied to many real life applications and a wide variety of formulations and solution approaches also have been presented [27, 28, 29]. Nie et al. [18] firstly combined the $l_{2,1}$ -norm regularization with a $l_{2,1}$ -norm loss function instead of the Frobenius norm loss function, and demonstrated that the proposed model is more robust to outliers than the original model. In [30], Masaeli et al. proposed the linear discriminant feature selection (LDFS) method which enforces row sparsity on the transformation matrix of LDA through $l_{\infty,1}$ -norm regularization. After that, Zhang et al. [20] combined the LDA model with $l_{2,1}$ -norm regularization to select more discriminative features, and Tao et al. [31] combined the LDA with a $l_{2,p}$ -norm regularization to obtain a better row-sparse solution. Recently, Yan et al. [32] imposed $l_{2,1}$ -norm and nonnegative constraints on the model weights. The nonnegative constrain ensures the row-sparsity of learned model weights combining with the $l_{2,1}$ -norm minimization, which makes it clearer to select which features. $L_{2,1}$ -norm regularization also has been widely used in unsupervised feature selection task. As in [19], Yang et al. proposed an unsupervised feature selection joint model which incorporates discriminative analysis into the $l_{2,1}$ -norm minimization. More related works can be seen in [22, 33]

Although the $l_{2,1}$ -norm based model can achieve satisfactory results, it is difficult to determine how many features need to be selected after the regularization factor tuned. From the sparsity perspective, $l_{2,0}$ -norm is more desirable.

2.3 Multi-class feature selection based on

l_{2,0}

-norm constrained problem

The multi-class feature selection based on $l_{2,0}$ -norm always construct an equality constraint to determine the number of non-zero rows of weights matrix. In [24], Cai et al. construct the objective function as a $l_{2,1}$ -norm loss function with a $l_{2,0}$ -norm equality constraint, which can be formulated as

$\displaystyle\underset{\textbf{W,b}}{\min}||\textbf{W}^{T}\textbf{X}+\textbf{b% }\textbf{1}^{T}-\textbf{Y}||_{\text{2,}1}$

(2) $\displaystyle s.t.||\textbf{W}||_{\text{2,}0}=k,$

then the Augmented Lagrangian Multiplier (ALM) method is used to solve this problem.

In [25], the authors form a similar model that is written as follows:

$\displaystyle\underset{\textbf{W,b}}{\min}||\textbf{W}^{T}\textbf{X}+\textbf{b% }\textbf{1}^{T}-\textbf{YQ}||_{\text{2,}1}$ (3) $\displaystyle s.t.||\textbf{W}||_{\text{2,}0}=k,$

where $\textbf{Q}\in R^{C\times C}$ can be any reversible matrix which is used to code labels. The authors have transformed model Eq. (2.3) into the same structure as LDA by using this label coding method, which can calculate the ratio of inter-class distance to intra-class distance, so that discriminative features can be selected. Similarly, Won et al. [34] combined the $l_{2,0}$ -norm equality constraint with SVM model to select features in group for networked data.

Though the above mentioned feature selection methods are based on $l_{2,0}$ -norm, they also need to predefine the number of selected features to construct the equality constraint. How many features need be selected is unknown, so it will cost a large amount of time and effort to tune the number of selected features to get a satisfactory result. From literatures of sparsity research, it has been proven that the regularized problem is more effective than the equality constraint problem to find a sparse solution.
3. Robust feature selection based on $l_{2,0}$ -norm regularization

3.1 Problem formulation

In this work, we use the least squares regression combined with a $l_{2,0}$ -norm regularization for multi-class feature selection, the optimization function is formulated as

$\displaystyle\varphi_{\lambda}(\textbf{W})=\underset{\textbf{W}}{\min}\frac{1}% {2}||\textbf{W}^{T}\textbf{X}+\textbf{b}\textbf{1}^{T}-\textbf{Y}||_{F}^{2}+% \lambda||\textbf{W}||_{2,0},$ (4)

where the notations are defined as in 2.2.

According to the Karush-Kuhn-Tucker (KKT) theorem [35], set the derivative of Eq. (4) with respect to b to be 0, then the optimal solution of b can be obtained as follows:

$\displaystyle\mathbf{b}=\frac{1}{N}\left(\mathbf{Y}-\mathbf{W}^{T}\mathbf{X}% \right)\mathbf{1}.$ (5)

Replacing b with Eq. (5), we can rewrite Eq. (4) as

$\displaystyle\varphi_{\lambda}(\textbf{W})=\underset{\mathbf{W}}{\min}\frac{1}% {2}||(\mathbf{W}^{T}\mathbf{X}-\mathbf{Y})\mathbf{H}||_{F}^{2}+\lambda||% \mathbf{W||}_{\text{2,}0},$ (6)

where $\textbf{H}=\textbf{I}-(1/N)\mathbf{11}^{T}$ is referred as the centering matrix, which has following properties

$\displaystyle\mathbf{H}=\mathbf{H}^{T}=\mathbf{HH}^{T}=\mathbf{H}^{T}\mathbf{H}.$ (7)

Now, the objective function of $l_{2,0}$ -FS is presented totally, next we will show how to optimize it. Iterative hard thresholding (IHT) algorithm [36, 37, 38] is usually used to solve $l_{0}$ -norm regularization minimization for the vector sparsity problem. Inspired by it, we extended IHT to solve the matrix sparsity Eq. (6).

3.2 Optimization algorithm

Denoting $\mathbf{\tilde{X}}=\mathbf{XH}$ and $\mathbf{\tilde{Y}}=\textbf{YH}$ , Eq. (6) can be reformulated as

$\displaystyle\varphi_{\lambda}(\mathbf{W})=\underset{\mathbf{W}}{\min}\frac{1}% {2}||\mathbf{W}^{T}\mathbf{\tilde{X}}-\mathbf{\tilde{Y}||}_{F}^{2}+\lambda||% \mathbf{W||}_{2,0}.$ (8)

IHT algorithms use the proximal gradient technique to update the solution iteratively. Firstly, we rewrite Eq. (8) as

$\displaystyle\varphi_{\lambda}(\mathbf{W})=\underset{\mathbf{W}}{\min}f(% \textbf{W})+\lambda||\mathbf{W||}_{2,0},$ (9)

where $f(\textbf{{W}})=\frac{1}{2}||\textbf{W}^{T}\mathbf{\tilde{X}}-\mathbf{\tilde{Y% }}||_{F}^{2}$ is a differentiable convex function, and its gradient $\nabla f(\textbf{W})$ satisfies the Lipschitz continuous condition (denote its Lipschitz constant as $L_{f}$ ). Therefore, $f(\textbf{{W}})$ can be approximately iteratively updated by the projected gradient method

$\displaystyle\textbf{W}^{t+1}=\arg\mathop{\min}\limits_{\textbf{W}}f(\textbf{W% }^{t})+tr(\nabla f(\textbf{W}^{t})^{T}\left(\textbf{W}-\textbf{W}^{t}\right))+% \frac{{L}}{2}||\textbf{W}-\textbf{W}^{t}||_{F}^{2},$ (10)

where $L\geqslant 0$ is a constant, which should satisfy the condition of $L\geqslant L_{f}$ .

Adding $\lambda||\textbf{W}||_{2,0}$ into both sides of Eq. (10), Eq. (8) can be solved by iteratively optimizing the subproblem

[b] Homotopy iterative hard thresholding method to solve Eq. (4)(Input:) Training data $\textbf{X}\in R^{d\times N}$ , training labels $\textbf{Y}\in R^{C\times N}$ , centering matrix $\textbf{H}\in R^{N\times N}$ , parameters $L_{0},{\lambda}_{0},\lambda_{tgt}$ ; (Output:) $\textbf{W}^{*}$ ; 1: initialize $\textbf{W}^{0}=\textbf{0},\rho\in(0,1),\epsilon>0,\eta>0,\gamma>1,k\leftarrow 0$ ; 2: do3: $t\leftarrow 0$ ;4: $\textbf{W}^{k,0}=\textbf{W}^{k}$ ;5: $L_{k,0}\leftarrow L_{k}$ ;6: do An L-tuning iteration indexed by $t$ 7: update $\textbf{W}^{k,t+1}$ by Eq. (14);8: while ${\varphi}_{\lambda_{k}}(\textbf{W}^{k,t})-{\varphi}_{\lambda_{k}}(\textbf{W}^{% k,t+1})<\frac{\eta}{2}||\textbf{W}^{k,t}-\textbf{W}^{k,t+1}||_{F}^{2}$ do9: $L_{k,t}\leftarrow\gamma L_{k,t}$ ;10: update $\textbf{W}^{k,t+1}$ by Eq. (14);11: end while12: $L_{k,t+1}\leftarrow L_{k,t}$ ;13: $t\leftarrow t+1$ ;14: until $||\textbf{W}^{k,t}-\textbf{W}^{k,t+1}||^{2}_{F}\leqslant\epsilon$ 15: $\textbf{W}^{k+1}\leftarrow\textbf{W}^{k,t}$ ;16: $L_{k+1}\leftarrow L_{k,t}$ .17: ${\lambda}_{k+1}\leftarrow\rho{\lambda}_{k}$ ;18: $k\leftarrow k+1$ ;19: until ${\lambda}_{k+1}\leqslant\lambda_{\textit{tgt}}$ ;20: $\textbf{W}^{\star}\leftarrow\textbf{W}^{k}$ .

By removing the item $f(\textbf{W}^{t})$ and adding an item $\frac{1}{2L}||\nabla f(\textbf{W}^{t})||_{F}^{2}$ , both of which are independent on W) and thus can be seen as constant items, the right hand of Eq. (11) can be rewritten as

$\displaystyle\textbf{W}^{t+1}=\arg\mathop{\min}\limits_{\textbf{W}}\frac{L}{2}% \left\{||\textbf{W}-(\textbf{W}^{t}-\frac{1}{{L}}\nabla f(\textbf{W}^{t}))||_{% F}^{2}+\frac{{2\lambda}}{{L}}||\textbf{W}||_{2,0}\right\},$ (12)

Since the Frobenius norm and $l_{2,0}$ -norm are all separable function (both of these two functions can be represented as a sum of single-variable functions), each row of W can be updated individually

$\displaystyle(\textbf{w}^{i})^{t+1}=\arg\mathop{\min}\limits_{\textbf{w}^{i}}% \frac{L}{2}\left\{||\textbf{w}^{i}-((\textbf{w}^{i})^{t}-\frac{1}{L}\nabla f((% \textbf{w}^{i})^{t}))||_{2}^{2}+\frac{2\lambda}{L}{1}_{||\textbf{w}^{i}||_{2}% \neq 0}\right\},$ (13)

this subproblem has a closed-form solution as

$\displaystyle(\textbf{w}^{i})^{t+1}=\left\{{\begin{array}[]{l}{(\textbf{w}^{i}% )^{t}-\frac{1}{{L}}\nabla f((\textbf{w}^{i})^{t}),if\;||(\textbf{w}^{i})^{t}-% \frac{\nabla f((\textbf{w}^{i})^{t})}{{L}}||_{2}^{2}>\frac{{2\lambda}}{{L}}}\\ {\textbf{0},\text{ otherwise}}\\ \end{array}}\right.$ (14)

in which the parameter $L$ needs to be tuned. The upper bound on Lipschitz constant $L_{f}$ is not be easily calculated, thus we use the line search method to tune $L$ as suggested in [37] until the objective value descent.

Homotopy Strategy: many works [39, 40] have verified that the sparse coding approaches benefit from a good starting point. Thus we can use the solution of Eq. (8), for a given value of $\lambda$ , to initialize IHT in a nearby value of $\lambda$ . Usually, the next loop with warm starting will require fewer iterations than the current loop. Using this warm-starting technique, we can efficiently solve Eq. (8) for a sequence of values of $\lambda$ , which is called homotopy strategy. The complete routine of the HIHT algorithm for solving Eq. (8) is described in Algorithm 3.2.

3.3 Convergence analysis

For a fixed $\lambda_{t}$ , since $\bigtriangledown f$ is Lipschitz continuous with constant $L_{f}$ , we have

$\displaystyle f(\textbf{W}^{k+1})\leqslant f(\textbf{W}^{k})+tr(% \bigtriangledown f(\textbf{W}^{k})^{T}(\textbf{W}^{k+1}-\textbf{W}^{k}))+\frac% {L_{f}}{2}||\textbf{W}^{k+1}-\textbf{W}^{k}||^{2}.$ (15)

Using this inequality, the fact that $L>L_{f}$ , and Eq. (11), we obtain that

$\displaystyle{\varphi}_{\lambda_{t}}(\textbf{W}^{k+1})=f(\textbf{W}^{k+1})+% \lambda_{t}||\textbf{W}^{k+1}||_{0}$ $\displaystyle\quad\leqslant f(\textbf{W}^{k})+tr(\bigtriangledown f(\textbf{W}% ^{k})^{T}(\textbf{W}^{k+1}-\textbf{W}^{k}))+\frac{L_{f}}{2}||\textbf{W}^{k+1}-% \textbf{W}^{k}||^{2}+\lambda_{t}||\textbf{W}^{k+1}||_{0}$ (16) $\displaystyle\quad\leqslant f(\textbf{W}^{k})+tr(\bigtriangledown f(\textbf{W}% ^{k})^{T}(\textbf{W}^{k+1}-\textbf{W}^{k}))+\frac{L}{2}||\textbf{W}^{k+1}-% \textbf{W}^{k}||^{2}+\lambda_{t}||\textbf{W}^{k+1}||_{0}$ $\displaystyle\quad\leqslant f(\textbf{W}^{k})+\lambda_{t}||\textbf{W}^{k}||_{0% }={\varphi}_{\lambda_{t}}(\textbf{W}^{k}),$

where the last inequality follows from Eq. (11).1

$\textbf{W}^{k+1}$ is the optimal solution of Eq. (11) at $k+1$ iteration, thus the objective value of Eq. (11) with respect to $\textbf{W}^{k+1}$ is less than that with respect to $\textbf{W}^{k}$ . Replacing W by $\textbf{W}^{k}$ in the right side of Eq. (11), then the last inequality holds.

The above inequality implies that for a fixed

\lambda_{t}

{\varphi}_{\lambda_{t}}\{\textbf{W}^{k}\}

is non-increasing and moreover,

$\displaystyle{\varphi}_{\lambda_{t}}(\textbf{W}^{k})-{\varphi}_{\lambda_{t}}(% \textbf{W}^{k+1})\geqslant\frac{L-L_{f}}{2}||\textbf{W}^{k+1}-\textbf{W}^{k}||% ^{2}.$ (17)

Since $f(\textbf{W})$ is bounded below, it then follows that ${\varphi}_{\lambda_{t}}\{\textbf{W}^{k}\}$ is bounded below. Hence, ${\varphi}_{\lambda_{t}}\{\textbf{W}^{k}\}$ converges to a finite value as $k\rightarrow\infty$ and a local optimal solution $\textbf{W}_{\lambda_{t}}^{*}$ can be achieved.

Since the $\lambda$ is monotonically decreased, and $\textbf{W}_{\lambda_{t}}^{*}$ is set as the initial solution for HIHT in $\lambda_{t+1}$ , we obtain that:

$\displaystyle{\varphi}_{\lambda_{t}}(\textbf{W}_{\lambda_{t}}^{*})>{\varphi}_{% \lambda_{t+1}}(\textbf{W}_{\lambda_{t}}^{*})={\varphi}_{\lambda_{t+1}}(\textbf% {W}_{\lambda_{t+1}}^{0})\geqslant{\varphi}_{\lambda_{t+1}}(\textbf{W}_{\lambda% _{t+1}}^{*}),$ (18)

it implies that the objective value is monotonically decreasing and a local optimal solution can be achieved by the proposed algorithm. We show an example of the objective value at each iteration in Fig. 1, it can be seen that the objective function values monotonically decrease at each iteration until convergence, which verifies the convergence of Algorithm 3.2 experimentally.

Figure 1.

An example of objective value at each iteration.

3.4 Acceleration of HIHT

Inspired by [39], we derived an accelerated version for HIHT to reduce the computational time. For each fixed value of $\lambda$ , Algorithm 3.2 repeats steps 6–14 to get a exact solution of Eq. (8). In the following, for acceleration of Algorithm 3.2, just one outer loop is called to replace steps 6–14. The complete routine of the the AHIHT method is described in Algorithm 3.4.

[t] Acceleration of HIHT(Input:) Training data $\textbf{X}\in R^{d\times N}$ , training labels $\textbf{Y}\in R^{C\times N}$ , centering matrix $\textbf{H}\in R^{N\times N}$ , parameters $L_{0},{\lambda}_{0},\lambda_{\textit{tgt}}$ ; (Output:) $\textbf{W}^{*}$ ; 1: initialize $\textbf{W}^{0}=\textbf{0},\rho\in(0,1),\epsilon>0,\eta>0,\gamma>1,k\leftarrow 0$ ; 2: do An L-tuning iteration3: update $\textbf{W}^{k+1}$ by Eq. (14);4: while ${\varphi}_{\lambda_{k}}(\textbf{W}^{k})-{\varphi}_{\lambda_{k}}(\textbf{W}^{k+% 1})<\frac{\eta}{2}||\textbf{W}^{k}-\textbf{W}^{k+1}||_{F}^{2}$ do5: $L_{k}\leftarrow\gamma L_{k}$ ;6: update $\textbf{W}^{k+1}$ by Eq. (14);7: end while8: $L_{k+1}\leftarrow L_{k}$ ;9: ${\lambda}_{k+1}\leftarrow\rho{\lambda}_{k}$ ;10: $k\leftarrow k+1$ ;11: until ${\lambda}_{k+1}\leqslant\lambda_{\textit{tgt}}$ 12: $\textbf{W}^{\star}\leftarrow\textbf{W}^{k}$ .

4. Experiment

The proposed $l_{2,0}$ -FS method is evaluated by several experiments. The experiments are divided into three parts: (1) We evaluate the proposed method in terms of No.fea and Acc, and compare the results with Baseline and other six state-of-the-art multi-class feature selection methods. (2) We evaluate the performance sensitivity to the regularization factor $\lambda$ . The influence of initialization of AHIHT for feature selection is also evaluated. (3) We compare the convergence speed and computational time of some sparsity-based methods.

4.1 Data sets description

Eight biological benchmark data sets are used to validate the performance of our approach in the experiments: Brain [41], Breast3 [42], Leukemia [43], Lung [44], Lymphoma [45], NCI [46], Prostate [47], and Srbct [48]. The 8 benchmark data sets were downloaded from Feiping Nie’s homepage,2

²
http://www.escience.cn/system/file?fileId=82035.

Table 1 gives a detail introduction to these data sets, it can be seen that these datasets are all high-dimensional with small sample sizes, it is time-consuming and may result in over-fitting if classify these datasets using raw features, therefore feature selection is important for these datasets.

Table 1

Datasets desciption

Datasets	#samples	#features	#classes
Brain	42	5597	5
Breast3	95	4869	3
Leukemia	38	3051	2
Lung	203	3312	5
Lymphoma	62	4026	3
NCI	61	5244	8
Prostate	102	6033	2
Srbct	63	2308	4

Table 2

The ACC value with corresponding No.fea using selected features (red is the best result and blue is the second one)

Dataset	Baseline	Relief		mRMR		RFS		RLSR		RPMFS		ESFS		$l_{2,0}$ -FS
	Acc	No. fea	Acc	No. fea	Acc	No. fea	Acc	No. fea	Acc	No. fea	Acc	No. fea	Acc	No. fea	Acc
	KNN
Brain	71.67	400	85.00	200	81.67	[rgb]0, 0, 160	[rgb] 0, 0, 187.50	380	80.00	340	82.50	200	81.67	[rgb]1, 0, 0100	[rgb]1, 0, 088.33
Breast3	50.65	160	54.58	[rgb]0, 0, 1380	[rgb]0, 0, 158.39	340	55.48	340	58.06	340	53.55	[rgb]0, 0, 1380	[rgb] 0, 0, 158.39	[rgb]1, 0, 0140	[rgb]1, 0, 061.29
Leukemia	96.67	400	97.50	240	99.17	[rgb]0, 0, 1260	[rgb]0, 0, 1100.00	400	97.50	400	99.17	200	99.17	[rgb]1, 0, 080	[rgb]1, 0, 0100.00
Lung	94.39	[rgb]0.00,0.00,1.00380	[rgb]0.00,0.00,1.0094.39	200	94.09	[rgb]1, 0, 080	[rgb]1, 0, 095.67	400	91.67	300	92.88	400	91.82	200	93.33
Lymphoma	97.50	340	99.50	[rgb]1, 0, 060	[rgb]1, 0, 0100.00	140	100.00	260	99.00	280	99.00	20	99.00	[rgb]0, 0, 140	[rgb]0, 0, 1100.00
NCI	73.89	340	73.33	[rgb]0, 0, 1240	[rgb]0, 0, 173.89	[rgb]0, 0, 1240	[rgb]0, 0, 173.89	220	72.78	380	67.78	320	73.33	[rgb]1, 0, 060	[rgb] 1, 0, 074.44
Prostate	81.82	[rgb]0, 0, 140	[rgb]0, 0, 191.82	60	90.30	[rgb]1, 0, 020	[rgb]1, 0, 093.03	20	89.09	20	86.06	20	87.88	[rgb]1, 0, 020	[rgb]1, 0, 093.94
Srbct	93.16	240	98.95	260	98.95	[rgb]1, 0, 020	[rgb]1, 0, 0100.00	60	96.84	380	93.68	400	100.00	[rgb]0, 0, 140	[rgb]0, 0, 1100.00
	Softmax
Brain	82.50	400	88.33	400	88.33	[rgb]0, 0, 1160	[rgb]0, 0, 192.50	220	85.83	280	88.33	400	85.42	[rgb]1, 0, 0180	[rgb]1, 0, 093.33
Breast3	58.71	280	59.35	240	58.06	340	61.94	[rgb]0, 0, 1300	[rgb] 0, 0, 161.94	340	60.97	400	58.39	[rgb]1, 0, 0280	[rgb]1, 0, 062.90
Leukemia	99.17	320	98.33	60	99.17	120	99.17	[rgb]0, 0, 1220	[rgb]0, 0, 1100.00	240	100.00	400	99.17	[rgb]1, 0, 0120	[rgb]1, 0, 0100.00
Lung	[rgb]0.00,0.00,1.0095.76	280	94.70	400	95.30	[rgb]1, 0, 0180	[rgb]1, 0, 096.36	320	93.94	340	94.55	400	94.39	300	95.61
Lymphoma	94.00	220	95.00	20	95.50	[rgb]0, 0, 160	[rgb]0, 0, 198.50	60	96.00	320	97.50	[rgb]0, 0, 160	[rgb]0, 0, 198.50	[rgb]1, 0, 0280	[rgb]1, 0, 099.50
NCI	[rgb]0, 0, 176.11	400	75.00	140	72.22	140	73.33	360	71.67	400	72.22	260	73.33	[rgb]1, 0, 0220	[rgb]1, 0, 078.33
Prostate	90.61	120	92.73	260	92.12	[rgb]0, 0, 1140	[rgb] 0, 0, 195.45	400	93.94	280	91.52	260	93.33	[rgb]1, 0, 0100	[rgb]1, 0, 095.76
Srbct	99.47	240	98.42	20	97.89	[rgb]1, 0, 020	[rgb]1, 0, 0100.00	160	97.37	180	96.84	280	98.95	[rgb]0, 0, 180	[rgb]0, 0, 1100.00

Figure 2.

Acc vs. No.fea using selected features by KNN. a. Breast3. b. Lung. c. NCI. d. Srbct.

Figure 3.

Acc vs. No.fea using selected features by Softmax. a. Brain. b. Leukemia. c. Lymphoma. d. Prostate.

Figure 4.

Parameter sensitivity demonstration on different data sets. a. Brain. b. Leukemia. c. Lung. d. Prostate.

Figure 5.

Results of classification accuracy and the number of selected features with respect to different initialization. a. Breast3. b. Lung. c. Srbct.

Figure 6.

The objective function value v.s. iteration number. a. Leukemia. b. Prostate.

Figure 7.

The average computational time of each sparsity regularization method.

Figure 8.

Classification accuracy and the number of selected features with respect to $\lambda$ of AHIHT and HIHT. a. Breast3. b. Lung. c. NCI.

4.2 Experiment setup

In the experiments, we evaluate the feature selection performance on classification accuracy, where two popular classifiers KNN and Softmax are used. For each data set, $\frac{2}{3}$ of samples per class are randomly selected as training samples, and the rest samples are responsible for testing. Ten repeated trials are carried out and the average results are recorded for comparison. We compare our feature selection method with Baseline (without feature selection) and six state-of-the-art feature selection methods: (1) two basic filter methods: Relief [10], and mRMR [12]; (2) two $l_{2,1}$ -norm based methods: RFS [18], and RLSR [21]; (3) two $l_{2,0}$ -norm constrained methods: RPMFS [24], and EFSF [25]. The codes of Relief and mRMR are provided by the FEAST package [49], and others can be downloaded from the authors’ homepages.

For our method, the parameter $\lambda$ is searched in the grid of $\{10^{-5},10^{-4},\ldots,10^{0}\}$ , and for RFS and RLSR it is searched in $\{10^{-5},10^{-3},\ldots,10^{5}\}$ . For all methods, the number of selected features is tuned from $\{20,40,\ldots,400\}$ in this paper. The $k$ nearest neighbors of KNN is set as $k=5$ . All other parameters take the default values as suggested by the authors. For RLSR, all training data are used as labeled data, thus it can be seen as a supervised method here. The number of selected features with highest classification accuracy are recorded.

4.3 Classification performance

The classification performance of our approach is evaluated in this part. Table 2 shows the best Acc of each method with corresponding No.fea. From this table it can be seen that, our method can achieve highest Acc with fewest No.fea on most data sets, or comparable results to the best ones. On NCI dataset, the Acc of our approach obtained by Softmax is 78.33%, which is 3.33% higher than the second one (expect Baseline). RPMFS, ESFS and our algorithm both solve the original $l_{2,0}$ -norm problem, but RPMFS and ESFS try to solve an equality constraint problem so that the number of selected features needs to be tuned carefully to obtain a satisfactory classification result. The numerical comparison shows that our method outperforms RPMFS and ESFS most of the time, which indicates that our approach is more efficient than RPMFS and ESFS. The classification results demonstrate that our method can remove more redundant features while maintains the discriminative performance.

Figures 2 and 3 show Acc V.S. No.fea obtained by KNN and Softmax, respectively. It is obvious that the proposed method distinctly outperforms other approaches in the most experimental data sets. What’s more, compared with other lines, ours are not fluctuated too much as the No.fea changed, which indicates the performance of the proposed approach is much more robust to the number of selected features than other methods.

4.4 Parameter sensitivity

We evaluate the sensitivity of $\lambda$ in this part, the Acc is employed to evaluate the performance of classification with $\lambda$ searched in the grid of $\{10^{-5},10^{-4},10^{-3},10^{-2},10^{-1}\}$ . The number of selected features varies in $\{30,60,\ldots,150,180\}$ . The data sets of Brain, Leukemia, Lung and Prostate are used for testing, and the experimental results are shown in Fig. 4. It can be seen from this figure that, the performance of $l_{2,0}$ -FS is not very sensitive to $\lambda$ . When $\lambda$ is in the range of $\{10^{-5},\ldots,10^{-2}\}$ , there is almost no fluctuation on Acc. Thus the proposed approach is robust to $\lambda$ and there is no need to spend much time to tune the value of $\lambda$ . It also can be seen that the performance gets worst when $\lambda=10^{-1}$ , the reason is that a large value of regularization factor will hurt data approximation performance.

$L_{2,0}$ -norm is a non-convex problem, which may make the solution sensitive to the initialization. Three kinds of initialization are used to explore the effect on classification performance: zero-valued initialization, random Gaussian distributed initialization, and random uniform distributed initialization. For Gaussian and uniform distributed initialization, ten repeated trials are carried out and the average results are recorded. Figure 5 shows the results (in this figure, the No.fea is equal to the number of non-zero rows of W), in which data sets Breast3, Lung, and Srbct are used. It can be seen that these three different initialization methods can get the same results, indicating that AHIHT is not sensitive to initialization when applied to feature selection.

4.5 Comparison of convergence speed and time consumption

Figure 6 plots the objective function value for each iteration of the three $l_{2,0}$ -norm based methods. As can be observed, our method can decrease the objective value quickly in early iterations, which indicates that our method converges faster than RPMFS and ESFS. Figure 7 shows the average computational time of each sparsity regularization method on the eight data sets, which also compares AHIHT with HIHT. It can be seen that the computational time of RFS, RPMFS and AHIHT are much less than the other three methods, the reason is that the computational complexity of RFS, RPMFS and AHIHT is in propotion to $d$ while for other methods it is in propotion to $d^{2}$ or $d^{3}$ . Comparing AHIHT with HIHT, it can be found that AHIHT can reduce the computational time of HIHT effectively, thus is more suitable for practical applications.

4.6 Comparison of AHIHT and HIHT

In this part, we compare the performance of HIHT and its accelerated version in terms of accuracy and the row-sparsity of the produced solutions with different values of $\lambda$ . Figure 8 shows the results, in which data sets Breast3, Lung, and NCI are used. From this figure, it can be seen that, when $\lambda$ is less than 0.1, HIHT can get a more sparse solution than AHIHT, while AHIHT can get higher classification accuracy than HIHT, it means that the features selected by AHIHT are more useful for classification than those of HIHT. What’s more, the results obtained by AHIHT demonstrate AHIHT is more robust to $\lambda$ than HIHT. For HIHT, it should tune a small value of $\lambda$ to get satisfactory results, while a small value of $\lambda$ will spend more computational time, which has been shown in Fig. 7. In conclusion, AHIHT is more practical than HIHT for feature selection.

5. Conclusions

This paper proposes an effective algorithm to solve the original $l_{2,0}$ -norm regularization least square problem for multi-class feature selection, in which a homotopy iterative hard thresholding is proposed to optimize the proposed model, thus the exact row-sparsity solution can be obtained. What’s more, an accelerated version of HIHT is derived to reduce the computational time, that is more practical for real-word applications. Experimental results on eight biological data sets have shown that the proposed approach can achieve better or comparable classification performance compared with other six state-of-the-art feature selection algorithms.

Footnotes

Acknowledgments

This work is supported by National Natural Science Foundation of China (NSFC) under grant #61873067.

References

Dash

and Liu

, Feature selection for classification, Intelligent Data Analysis 1 (1997), 131–156.

Liu

Yang

Zhou

and Lu

, Clustering-guided sparse structural learning for unsupervised feature selection, IEEE Transactions on Knowledge and Data Engineering 26 (2014), 2138–2150.

Zhao

Cai

et al., Graph regularized feature selection with data reconstruction, IEEE Transactions on Knowledge and Data Engineering 28 (2016), 689–700.

Qasim

O.S.

Mahmoud

M.S.

and Hasan

F.M.

, Hybrid binary dragonfly optimization algorithm with statistical dependence for feature selection, International Journal of Mathematical Engineering and Management Sciences 5 (2020), 1420–1428.

Suchetha

N.K.

Nikhil

and Hrudya

, Comparing the wrapper feature selection evaluators on twitter sentiment classification, in: 2019 International Conference on Computational Intelligence in Data Science (ICCIDS), 2019, pp. 1–6.

Tang

Kay

and He

, Toward optimal feature selection in naive bayes for text categorization, IEEE Transactions on Knowledge and Data Engineering 28 (2016), 1602–1606.

Wan

Wang

Liu

and Feng

, Composite feature extraction and selection for text classification, IEEE Access 7 (2019), 35208–35219.

Saeys

Inza

and Larranaga

, A review of feature selection techniques in bioinformatics, Bioinformatics 23 (2007), 2507–2517.

Lei

B.Y.

Zhao

Y.J.

et al., Adaptive sparse learning using multi-template for neurodegenerative disease diagnosis, Medical Image Analysis 61 (2020), 101632.

10.

Kira

and Rendell

L.A.

, A practical approach to feature selection, in: Proceedings of the Ninth International Workshop on Machine Learning, 1992, pp. 249–256.

11.

Raileanu

L.E.

and Stoffel

, Theoretical comparison between the gini index and information gain criteria, Annals of Mathematics and Artificial Intelligence 41 (2004), 77–93.

12.

Peng

Long

and Ding

, Feature selection based on mutual information criteria of max-dependency, max-relevance, and minredundancy, IEEE transactions on Pattern Analtsis and Machine Intelligence 27 (2005), 1226–1238.

13.

Guyon

Weston

Barnhill

and Vapnik

, Gene selection for cancer classification using support vector machines, Machine Learning 46 (2002), 389–422.

14.

Hall

M.A.

and Smith

L.A.

, Feature selection for machine learning: Comparing a correlation-based filter approach to the wrapper, in: Proceedings of the Twelfth International Florida Artificial Intelligence Research Society Conference, 1999, pp. 235–239.

15.

Weston

Elisseeff

Scholkopf

and Tipping

, Use of the zero-norm with linear models and kernel methods, Journal of Machine Learning Research 3 (2003), 1439–1461.

16.

Tibshirani

, Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society. Series B: Methodological 58 (1996), 267–288.

17.

Gui

Sun

Tao

and Tan

, Feature selection based on structured sparsity: A comprehensive study, IEEE Transactions on Neural Networks and Learning Systems 28 (2017), 1490–1507.

18.

Nie

Huang

Cai

and Ding

C.H.

, Efficient and robust feature selection via joint

l_{2,1}

-norms minimization, in: Advances in Neural Information Processing Systems, 2010, pp. 1813–1821.

19.

Yang

Shen

H.T.

Huang

and Zhou

L_{2,1}

-norm regularized discriminative feature selection for unsupervised learning, in: Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence, 2011, pp. 1589–1594.

20.

Zhang

Wang

and Zeng

Z.Q.

, L2,1-norm regularized fisher criterion for optimal feature selection, Neurocomputing 166 (2015), 455–463.

21.

Chen

Nie

Yuan

and Huang

J.Z.

, Semi-supervised feature selection via rescaled linear regression, in: Twenty-Sixth International Joint Conference on Artificial Intelligence, 2017, pp. 1525–1531.

22.

Zhang

Liu

and Nie

, Generalized uncorrelated regression with adaptive graph for unsupervised feature selection, IEEE Transactions on Neural Networks and Learning Systems 30 (2019), 1587–1595.

23.

Qian

and Zhai

, Joint adaptive loss and

l_{2}\backslash l_{0}

-norm minimization for unsupervised feature selection, in: 2015 International Joint Conference on Neural Networks (IJCNN), 2015, pp. 1–8.

24.

Cai

Nie

and Huang

, Exact top-k feature selection via

l_{2,0}

-norm constraint, in: Proceedings of the 23rd International Joint Conference on Artificial Intelligence, 2013, pp. 1240–1246.

25.

Pang

Nie

Han

and Li

, Efficient feature selection via l2,0-norm constrained sparse regression, IEEE Transactions on Knowledge and Data Engineering 31 (2019), 880–893.

26.

Yuan

and Lin

, Model selection and estimation in regression with grouped variables, Journal of the Royal Statistical Society. Series B: Statistical Methodology 68 (2006), 49–67.

27.

Jacob

Obozinski

and Vert

J.P.

, Group lasso with overlap and graph lasso, in: Proceedings of the 26th Annual International Conference on Machine Learning, 2009, pp. 433–440.

28.

Yuan

Liu

and Ye

, Efficient methods for overlapping group lasso, Advances in Neural Information Processing Systems 24 (2011), 352–360.

29.

Rao

Cox

Nowak

and Rogers

T.T.

, Sparse overlapping sets lasso for multitask learning and its application to fMRI analysis, Advances in Neural Information Processing Systems 26 (2013), 2202–2210.

30.

Masaeli

J.G.

and Fung

G.M.

, From transformation-based dimensionality reduction to feature selection, in: Proceedings of the Twenty-Seven International Conference on Machine Learning, 2010, pp. 751–758.

31.

Tao

Hou

Nie

Jiao

and Yi

, Effective discriminative feature selection with nontrivial solution, IEEE Transactions on Neural Networks and Learning Systems 27 (2016), 796–808.

32.

Yan

and Yang

, Robust joint feature weights learning framework, IEEE Transactions on Knowledge and Data Engineering 28 (2016), 1327–1339.

33.

Nie

Zhu

and Li

, Structured graph optimization for unsupervised feature selection, IEEE Transactions on Knowledge and Data Engineering 33 (2021), 1210–1222.

34.

Won

Manzour

and Chaovalitwongse

, Convex optimization for group feature selection in networked data, INFORMS Journal on Computing 31 (2020), 182–198.

35.

Karush

, Minima of functions of several variables with inequalities as side constraints, in: M.S. thesis, Dept. Math., Univ. Chicago, Ghicago, IL, USA, 1939.

36.

Blumensath

and Davies

M.E.

, Iterative thresholding for sparse approximations, Journal of Fourier Analysis and Applications 14 (2008), 629–654.

37.

, Iterative hard thresholding methods for l0 regularized convex cone programming, Mathematical Programming 147 (2014), 125–154.

38.

Jiang

de Lamare

R.C.

Zakharov

and He

, Knowledge-aided normalized iterative hard thresholding algorithms for sparse recovery, in: 26th European Signal Processing Conference, 2018, pp. 1965–1969.

39.

Dong

and Zhu

, Homotopy methods based on l0-norm for compressed sensing, IEEE Transactions on Neural Networks and Learning Systems 29 (2018), 1132–1146.

40.

X.G.

Jiang

H.M.

et al., Picasso: A sparse learning library for high dimensional data analysis in R and Python, Journal of Machine Learning Research 20 (2019), 1–5.

41.

Pomeroy

S.L.

Tamayo

Gaasenbeek

et al., Prediction of central nervous system embryonal tumour outcome based on gene expression, Nature 415 (2002), 436–442.

42.

van’t Veer

Dai

van de Vijver

et al., Gene expression profiling predicts clinical outcome of breast cancer, Nature 415 (2002), 530–536.

43.

Nutt

Mani

Betensky

R.A.

et al., Gene expression-based classification of malignant gliomas correlates better with survival than histological classification, Cancer Research 63 (2003), 1602.

44.

Bhattacharjee

Richards

W.G.

Staunton

et al., Classification of human lung carcinomas by mrna expression profiling reveals distinct adenocarcinoma subclasses, Proceeding of National Academy Science of United States America 98 (2001), 13790–13795.

45.

Alizadeh

A.A.

Eisen

M.B.

Davis

R.E.

et al., Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling, Nature 403 (2000), 503–511.

46.

Jeffrey

S.S.

de Rijn

M.V.

Waltham

et al., Systematic variation in gene expression patterns in human cancer cell lines, Nature Genetics 24 (2002), 227–235.

47.

Singh

Febbo

P.G.

Ross

et al., Gene expression correlates of clinical prostate cancer behavior, Cancer Cell 1 (2002), 203–209.

48.

Khan

Wei

J.S.

Ringner

et al., Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks, Proceeding of National Academy Science of United States America 7 (2001), 673–679.

49.

Brown

Pocock

Zhao

M.-J.

and Lujan

, A unifying framework for information theoretic feature selection, Journal of Machine Learning Research 13 (2012), 27–66.

Robust multi-class feature selection via l 2,0 -norm regularization minimization

Abstract

Keywords

1. Introduction

2. Related work

2.1 Notations and definitions

2.2 Multi-class feature selection based on l 2 , 1 -norm

3.1 Problem formulation

4. Experiment

4.1 Data sets description

2 http://www.escience.cn/system/file?fileId=82035.

4.3 Classification performance

4.4 Parameter sensitivity

4.5 Comparison of convergence speed and time consumption

4.6 Comparison of AHIHT and HIHT

5. Conclusions

Footnotes

Acknowledgments

References

2.2 Multi-class feature selection based on $l_{2,1}$ -norm

²
http://www.escience.cn/system/file?fileId=82035.