Feature selection is an important data preprocessing in data mining and machine learning, that can reduce the number of features without deteriorating model’s performance. Recently, sparse regression has received considerable attention in feature selection task due to its good performance. However, because the -norm regularization term is non-convex, this problem is hard to solve, and most of the existing methods relaxed it by -norm. Unlike the existing methods, this paper proposes a novel method to solve the -norm regularized least squares problem directly based on iterative hard thresholding, which can produce exact row-sparsity solution for weights matrix, and features can be selected more precisely. Furthermore, two homotopy strategies are derived to reduce the computational time of the optimization method, which are more practical for real-world applications. The proposed method is verified on eight biological datasets, experimental results show that our method can achieve higher classification accuracy with fewer number of selected features than the approximate convex counterparts and other state-of-the-art feature selection methods.
Feature selection, a process of selecting a subset of the raw features which are the most relevant and informative, has been widely investigated for many years [1, 2, 3, 4]. Due to the ability of reducing feature dimension, alleviating the effect of the curse of dimensionality, enhancing data understanding, speeding up the learning process and improving model’s performance, feature selection has become an essential component in data mining and machine learning, and has been widely applied in many practices, e.g., pattern recognition [5], text mining [6, 7], and bioinformatics [8, 9].
In General, feature selection methods can be categorized into three kinds depending on how they combine the feature selection search with model learning algorithms: filter methods, wrapper methods, and embedded methods. The filter methods use the intrinsic properties of data to select features before model learning, thus are independent of the learning algorithms. Typical filter methods include Relief [10], Chi-square and information gain [11], mRMR [12], etc. The wrapping methods treat the model learning algorithms as a black box to evaluate the selected features, such as support vector machine recursive feature elimination (SVM-RFE) [13] and correlation-based feature selection (CFS) [14]. Embedded methods incorporate the feature selection and model learning into a single optimization problem, such that higher computational efficiency and classification performance can be gained than filter methods and wrapped methods. Therefore, the embedded methods have gained large attention in recent years. Considering this, we study embedded methods in this paper.
Recently, with the development of sparsity researches, sparse regularization has been widely used in embedded feature selection methods. For binary classification task, the feature selection problem can be tackled by -norm minimization [15] directly, in which features corresponding to non-zero weights are selected. However, the non-convex and non-smooth of -norm make it very hard to solve. Most methods relax the -norm by -norm to make the minimization problem be convex and easy to solve, these methods were called LASSO [16]. Although some strategies such as one-to-all or one-to-one can be used to expand LASSO for multi-class feature selection task, structural sparsity models are more desirable so that the shared pattern of sparsity can be obtained. Inspired by that, many multi-class feature selection methods have been proposed based on structural sparsity model [17]. Nie et al. [18] firstly proposed a robust feature selection (RFS) method which combines a -norm loss function with a -norm regularization. The joint -norm minimization model has shown its superiority on feature selection, and is more robust to outliers. After that, the -norm regularization has been widely used in multi-class feature selection methods, e.g., UDFS [19], L-FS [20], RLSR [21], URAFS [22], etc.
Although satisfactory results can be achieved by using -norm regularization for multi-class feature selection, there are some limitations. First, -norm is only an approximation for -norm, thus the solution is different from the original optimal solution essentially. Second, Qian et al. [23] proved that -norm over-penalizes features with large weights, which will lead to an unfair competition between different features and hurt data approximation performance. Last, it is hard to tune the regularization factor of -norm to obtain exact row-sparsity solution, even a large regularization factor (e.g., ) cannot produce strong row-sparsity. Therefore, it is not clear that how many features need to be selected after the regularization factor tuned. Consequently, it is significant to solve the original -norm regularization problem.
In [24], Cai et al. proposed a robust and pragmatic multi-class feature selection (RPMFS) method based on -norm equality constrained optimization problem. RPMFS combines the -norm loss function with a -norm equality constraint, and uses the augmented Lagrangian method (ALM) to solve this equality constrained problem. In [25], Pang et al. also proposed an efficient sparse feature selection method (ESFS) based on -norm equality constrained problem. The equality constrained model is transformed into the same structure as linear discriminate analysis (LDA) so that the ratio of inter-class distance to intra-class distance of each feature can be calculated. However, the -norm equality constrained based methods exist two defects. First, the number of selected features needs to be pre-defined to construct the equality constraint. When tuning the number of selected features, the optimization program must be run repeatedly, which will cost a large amount of time and effort and is unpractical. Second, the overall performances of these methods are very sensitive to the number of selected features, in order to obtain satisfactory classification results, the number of selected features need to be tuned carefully. Literatures on sparsity research show that the regularized problem is more effective than the equality constraint problem to find a sparse solution.
Therefore, this paper proposes a novel and simple framework for multi-class feature selection that solves the original -norm regularization least squares problem (denoted as -FS) directly, in which a homotopy iterative hard thresholding method (HIHT) is introduced to perform the optimization. What’more, an accelerated version of HIHT (AHIHT) is derived to reduce the computational time. After learning, an exact row-sparsity solution is produced and the features can be selected in group. The contribution of this proposed -FS is that an efficient and effective optimization algorithm is designed to solve the -norm regularization least squares problem and produce row-sparsity solution for multi-class feature selection. After learning, the number of selected features can be fast tuned without rerunning the optimization program. To evaluate the performance of this proposed method, we use the selected features for classification, and compare the results with Baseline (without feature selection) and six supervised feature selection methods in terms of accuracy (Acc) and the number of selected features (No.fea) over eight biological data sets. The experimental results show that the performance of our approach is much better than the compared methods.
The remainder of this paper is organized as follows: Section 2 gives the notations and definitions used in this paper, and related works on -norm regularized problem and -norm equality constrained problem based multi-class feature selection methods are introduced. In Section 3, the proposed method -FS is presented and the optimization algorithms are introduced. Section 4 shows the experimental results, and conclusions and future work are given in Section 5.
Related work
Notations and definitions
This subsection summarizes the notations and definitions used in this paper. Matrices are presented as boldface uppercase letters, and vectors are presented as boldface lowercase letters. For a matrix , and denote its row and column, respectively. For a vector , denotes its element.
The -norm () of the vector is defined as
The -norm of the vector is defined as
which is a pseudo-norm and used to calculate the amount of non-zero elements of .
The Frobenius norm of a given matrix is defined as
and the -norm of is defined as
The -norm of matrix is defined as
where stands for the indicator function. For a scalar , if , , otherwise . Also, the -norm is a pseudo-norm, which is used to calculate the number of non-zero rows in . If a matrix has a large number of zero rows, we say it has the property of row-sparsity.
Multi-class feature selection based on -norm
In general, most multi-class feature selection algorithms based on -norm regularization can be formulated as follows:
where is the training data. is the binary label matrix with if belongs to class ; otherwise . denotes the model weights and denotes the learned biased vector. denotes the column vector with all elements being 1. denotes the sample size, is the number of classes, and is the feature dimension. is the regularized factor. After optimizing, the features of are selected according to the magnitude of .
This problem has been widely studied and a lot of variants have been proposed. Yuan et al. [26] proposed a Group LASSO as an extension of the LASSO by using the -norm for group feature selection. After the work of [26], the Group LASSO has been applied to many real life applications and a wide variety of formulations and solution approaches also have been presented [27, 28, 29]. Nie et al. [18] firstly combined the -norm regularization with a -norm loss function instead of the Frobenius norm loss function, and demonstrated that the proposed model is more robust to outliers than the original model. In [30], Masaeli et al. proposed the linear discriminant feature selection (LDFS) method which enforces row sparsity on the transformation matrix of LDA through -norm regularization. After that, Zhang et al. [20] combined the LDA model with -norm regularization to select more discriminative features, and Tao et al. [31] combined the LDA with a -norm regularization to obtain a better row-sparse solution. Recently, Yan et al. [32] imposed -norm and nonnegative constraints on the model weights. The nonnegative constrain ensures the row-sparsity of learned model weights combining with the -norm minimization, which makes it clearer to select which features. -norm regularization also has been widely used in unsupervised feature selection task. As in [19], Yang et al. proposed an unsupervised feature selection joint model which incorporates discriminative analysis into the -norm minimization. More related works can be seen in [22, 33]
Although the -norm based model can achieve satisfactory results, it is difficult to determine how many features need to be selected after the regularization factor tuned. From the sparsity perspective, -norm is more desirable.
Multi-class feature selection based on -norm constrained problem
The multi-class feature selection based on -norm always construct an equality constraint to determine the number of non-zero rows of weights matrix. In [24], Cai et al. construct the objective function as a -norm loss function with a -norm equality constraint, which can be formulated as
then the Augmented Lagrangian Multiplier (ALM) method is used to solve this problem.
In [25], the authors form a similar model that is written as follows:
where can be any reversible matrix which is used to code labels. The authors have transformed model Eq. (2.3) into the same structure as LDA by using this label coding method, which can calculate the ratio of inter-class distance to intra-class distance, so that discriminative features can be selected. Similarly, Won et al. [34] combined the -norm equality constraint with SVM model to select features in group for networked data.
Though the above mentioned feature selection methods are based on -norm, they also need to predefine the number of selected features to construct the equality constraint. How many features need be selected is unknown, so it will cost a large amount of time and effort to tune the number of selected features to get a satisfactory result. From literatures of sparsity research, it has been proven that the regularized problem is more effective than the equality constraint problem to find a sparse solution.
Robust feature selection based on -norm regularization
Problem formulation
In this work, we use the least squares regression combined with a -norm regularization for multi-class feature selection, the optimization function is formulated as
According to the Karush-Kuhn-Tucker (KKT) theorem [35], set the derivative of Eq. (4) with respect to b to be 0, then the optimal solution of b can be obtained as follows:
Replacing b with Eq. (5), we can rewrite Eq. (4) as
where is referred as the centering matrix, which has following properties
Now, the objective function of -FS is presented totally, next we will show how to optimize it. Iterative hard thresholding (IHT) algorithm [36, 37, 38] is usually used to solve -norm regularization minimization for the vector sparsity problem. Inspired by it, we extended IHT to solve the matrix sparsity Eq. (6).
IHT algorithms use the proximal gradient technique to update the solution iteratively. Firstly, we rewrite Eq. (8) as
where is a differentiable convex function, and its gradient satisfies the Lipschitz continuous condition (denote its Lipschitz constant as ). Therefore, can be approximately iteratively updated by the projected gradient method
where is a constant, which should satisfy the condition of .
Adding into both sides of Eq. (10), Eq. (8) can be solved by iteratively optimizing the subproblem
[b] Homotopy iterative hard thresholding method to solve Eq. (4)(Input:) Training data , training labels , centering matrix , parameters ; (Output:); 1: initialize ; 2: do3: ;4: ;5: ;6: doAn L-tuning iteration indexed by 7: update by Eq. (14);8: whiledo9: ;10: update by Eq. (14);11: end while12: ;13: ;14: until15: ;16: .17: ;18: ;19: until;20: .
By removing the item and adding an item , both of which are independent on W) and thus can be seen as constant items, the right hand of Eq. (11) can be rewritten as
Since the Frobenius norm and -norm are all separable function (both of these two functions can be represented as a sum of single-variable functions), each row of W can be updated individually
this subproblem has a closed-form solution as
in which the parameter needs to be tuned. The upper bound on Lipschitz constant is not be easily calculated, thus we use the line search method to tune as suggested in [37] until the objective value descent.
Homotopy Strategy: many works [39, 40] have verified that the sparse coding approaches benefit from a good starting point. Thus we can use the solution of Eq. (8), for a given value of , to initialize IHT in a nearby value of . Usually, the next loop with warm starting will require fewer iterations than the current loop. Using this warm-starting technique, we can efficiently solve Eq. (8) for a sequence of values of , which is called homotopy strategy. The complete routine of the HIHT algorithm for solving Eq. (8) is described in Algorithm 3.2.
Convergence analysis
For a fixed , since is Lipschitz continuous with constant , we have
Using this inequality, the fact that , and Eq. (11), we obtain that
is the optimal solution of Eq. (11) at iteration, thus the objective value of Eq. (11) with respect to is less than that with respect to . Replacing W by in the right side of Eq. (11), then the last inequality holds.
The above inequality implies that for a fixed , is non-increasing and moreover,
Since is bounded below, it then follows that is bounded below. Hence, converges to a finite value as and a local optimal solution can be achieved.
Since the is monotonically decreased, and is set as the initial solution for HIHT in , we obtain that:
it implies that the objective value is monotonically decreasing and a local optimal solution can be achieved by the proposed algorithm. We show an example of the objective value at each iteration in Fig. 1, it can be seen that the objective function values monotonically decrease at each iteration until convergence, which verifies the convergence of Algorithm 3.2 experimentally.
An example of objective value at each iteration.
Acceleration of HIHT
Inspired by [39], we derived an accelerated version for HIHT to reduce the computational time. For each fixed value of , Algorithm 3.2 repeats steps 6–14 to get a exact solution of Eq. (8). In the following, for acceleration of Algorithm 3.2, just one outer loop is called to replace steps 6–14. The complete routine of the the AHIHT method is described in Algorithm 3.4.
[t] Acceleration of HIHT(Input:) Training data , training labels , centering matrix , parameters ; (Output:); 1: initialize ; 2: doAn L-tuning iteration3: update by Eq. (14);4: whiledo5: ;6: update by Eq. (14);7: end while8: ;9: ;10: ;11: until12: .
Experiment
The proposed -FS method is evaluated by several experiments. The experiments are divided into three parts: (1) We evaluate the proposed method in terms of No.fea and Acc, and compare the results with Baseline and other six state-of-the-art multi-class feature selection methods. (2) We evaluate the performance sensitivity to the regularization factor . The influence of initialization of AHIHT for feature selection is also evaluated. (3) We compare the convergence speed and computational time of some sparsity-based methods.
Data sets description
Eight biological benchmark data sets are used to validate the performance of our approach in the experiments: Brain [41], Breast3 [42], Leukemia [43], Lung [44], Lymphoma [45], NCI [46], Prostate [47], and Srbct [48]. The 8 benchmark data sets were downloaded from Feiping Nie’s homepage,2
http://www.escience.cn/system/file?fileId=82035.
Table 1 gives a detail introduction to these data sets, it can be seen that these datasets are all high-dimensional with small sample sizes, it is time-consuming and may result in over-fitting if classify these datasets using raw features, therefore feature selection is important for these datasets.
Datasets desciption
Datasets
#samples
#features
#classes
Brain
42
5597
5
Breast3
95
4869
3
Leukemia
38
3051
2
Lung
203
3312
5
Lymphoma
62
4026
3
NCI
61
5244
8
Prostate
102
6033
2
Srbct
63
2308
4
The ACC value with corresponding No.fea using selected features (red is the best result and blue is the second one)
Dataset
Baseline
Relief
mRMR
RFS
RLSR
RPMFS
ESFS
-FS
Acc
No. fea
Acc
No. fea
Acc
No. fea
Acc
No. fea
Acc
No. fea
Acc
No. fea
Acc
No. fea
Acc
KNN
Brain
71.67
400
85.00
200
81.67
[rgb]0, 0, 160
[rgb] 0, 0, 187.50
380
80.00
340
82.50
200
81.67
[rgb]1, 0, 0100
[rgb]1, 0, 088.33
Breast3
50.65
160
54.58
[rgb]0, 0, 1380
[rgb]0, 0, 158.39
340
55.48
340
58.06
340
53.55
[rgb]0, 0, 1380
[rgb] 0, 0, 158.39
[rgb]1, 0, 0140
[rgb]1, 0, 061.29
Leukemia
96.67
400
97.50
240
99.17
[rgb]0, 0, 1260
[rgb]0, 0, 1100.00
400
97.50
400
99.17
200
99.17
[rgb]1, 0, 080
[rgb]1, 0, 0100.00
Lung
94.39
[rgb]0.00,0.00,1.00380
[rgb]0.00,0.00,1.0094.39
200
94.09
[rgb]1, 0, 080
[rgb]1, 0, 095.67
400
91.67
300
92.88
400
91.82
200
93.33
Lymphoma
97.50
340
99.50
[rgb]1, 0, 060
[rgb]1, 0, 0100.00
140
100.00
260
99.00
280
99.00
20
99.00
[rgb]0, 0, 140
[rgb]0, 0, 1100.00
NCI
73.89
340
73.33
[rgb]0, 0, 1240
[rgb]0, 0, 173.89
[rgb]0, 0, 1240
[rgb]0, 0, 173.89
220
72.78
380
67.78
320
73.33
[rgb]1, 0, 060
[rgb] 1, 0, 074.44
Prostate
81.82
[rgb]0, 0, 140
[rgb]0, 0, 191.82
60
90.30
[rgb]1, 0, 020
[rgb]1, 0, 093.03
20
89.09
20
86.06
20
87.88
[rgb]1, 0, 020
[rgb]1, 0, 093.94
Srbct
93.16
240
98.95
260
98.95
[rgb]1, 0, 020
[rgb]1, 0, 0100.00
60
96.84
380
93.68
400
100.00
[rgb]0, 0, 140
[rgb]0, 0, 1100.00
Softmax
Brain
82.50
400
88.33
400
88.33
[rgb]0, 0, 1160
[rgb]0, 0, 192.50
220
85.83
280
88.33
400
85.42
[rgb]1, 0, 0180
[rgb]1, 0, 093.33
Breast3
58.71
280
59.35
240
58.06
340
61.94
[rgb]0, 0, 1300
[rgb] 0, 0, 161.94
340
60.97
400
58.39
[rgb]1, 0, 0280
[rgb]1, 0, 062.90
Leukemia
99.17
320
98.33
60
99.17
120
99.17
[rgb]0, 0, 1220
[rgb]0, 0, 1100.00
240
100.00
400
99.17
[rgb]1, 0, 0120
[rgb]1, 0, 0100.00
Lung
[rgb]0.00,0.00,1.0095.76
280
94.70
400
95.30
[rgb]1, 0, 0180
[rgb]1, 0, 096.36
320
93.94
340
94.55
400
94.39
300
95.61
Lymphoma
94.00
220
95.00
20
95.50
[rgb]0, 0, 160
[rgb]0, 0, 198.50
60
96.00
320
97.50
[rgb]0, 0, 160
[rgb]0, 0, 198.50
[rgb]1, 0, 0280
[rgb]1, 0, 099.50
NCI
[rgb]0, 0, 176.11
400
75.00
140
72.22
140
73.33
360
71.67
400
72.22
260
73.33
[rgb]1, 0, 0220
[rgb]1, 0, 078.33
Prostate
90.61
120
92.73
260
92.12
[rgb]0, 0, 1140
[rgb] 0, 0, 195.45
400
93.94
280
91.52
260
93.33
[rgb]1, 0, 0100
[rgb]1, 0, 095.76
Srbct
99.47
240
98.42
20
97.89
[rgb]1, 0, 020
[rgb]1, 0, 0100.00
160
97.37
180
96.84
280
98.95
[rgb]0, 0, 180
[rgb]0, 0, 1100.00
Acc vs. No.fea using selected features by KNN. a. Breast3. b. Lung. c. NCI. d. Srbct.
Acc vs. No.fea using selected features by Softmax. a. Brain. b. Leukemia. c. Lymphoma. d. Prostate.
Parameter sensitivity demonstration on different data sets. a. Brain. b. Leukemia. c. Lung. d. Prostate.
Results of classification accuracy and the number of selected features with respect to different initialization. a. Breast3. b. Lung. c. Srbct.
The objective function value v.s. iteration number. a. Leukemia. b. Prostate.
The average computational time of each sparsity regularization method.
Classification accuracy and the number of selected features with respect to of AHIHT and HIHT. a. Breast3. b. Lung. c. NCI.
Experiment setup
In the experiments, we evaluate the feature selection performance on classification accuracy, where two popular classifiers KNN and Softmax are used. For each data set, of samples per class are randomly selected as training samples, and the rest samples are responsible for testing. Ten repeated trials are carried out and the average results are recorded for comparison. We compare our feature selection method with Baseline (without feature selection) and six state-of-the-art feature selection methods: (1) two basic filter methods: Relief [10], and mRMR [12]; (2) two -norm based methods: RFS [18], and RLSR [21]; (3) two -norm constrained methods: RPMFS [24], and EFSF [25]. The codes of Relief and mRMR are provided by the FEAST package [49], and others can be downloaded from the authors’ homepages.
For our method, the parameter is searched in the grid of , and for RFS and RLSR it is searched in . For all methods, the number of selected features is tuned from in this paper. The nearest neighbors of KNN is set as . All other parameters take the default values as suggested by the authors. For RLSR, all training data are used as labeled data, thus it can be seen as a supervised method here. The number of selected features with highest classification accuracy are recorded.
Classification performance
The classification performance of our approach is evaluated in this part. Table 2 shows the best Acc of each method with corresponding No.fea. From this table it can be seen that, our method can achieve highest Acc with fewest No.fea on most data sets, or comparable results to the best ones. On NCI dataset, the Acc of our approach obtained by Softmax is 78.33%, which is 3.33% higher than the second one (expect Baseline). RPMFS, ESFS and our algorithm both solve the original -norm problem, but RPMFS and ESFS try to solve an equality constraint problem so that the number of selected features needs to be tuned carefully to obtain a satisfactory classification result. The numerical comparison shows that our method outperforms RPMFS and ESFS most of the time, which indicates that our approach is more efficient than RPMFS and ESFS. The classification results demonstrate that our method can remove more redundant features while maintains the discriminative performance.
Figures 2 and 3 show Acc V.S. No.fea obtained by KNN and Softmax, respectively. It is obvious that the proposed method distinctly outperforms other approaches in the most experimental data sets. What’s more, compared with other lines, ours are not fluctuated too much as the No.fea changed, which indicates the performance of the proposed approach is much more robust to the number of selected features than other methods.
Parameter sensitivity
We evaluate the sensitivity of in this part, the Acc is employed to evaluate the performance of classification with searched in the grid of . The number of selected features varies in . The data sets of Brain, Leukemia, Lung and Prostate are used for testing, and the experimental results are shown in Fig. 4. It can be seen from this figure that, the performance of -FS is not very sensitive to . When is in the range of , there is almost no fluctuation on Acc. Thus the proposed approach is robust to and there is no need to spend much time to tune the value of . It also can be seen that the performance gets worst when , the reason is that a large value of regularization factor will hurt data approximation performance.
-norm is a non-convex problem, which may make the solution sensitive to the initialization. Three kinds of initialization are used to explore the effect on classification performance: zero-valued initialization, random Gaussian distributed initialization, and random uniform distributed initialization. For Gaussian and uniform distributed initialization, ten repeated trials are carried out and the average results are recorded. Figure 5 shows the results (in this figure, the No.fea is equal to the number of non-zero rows of W), in which data sets Breast3, Lung, and Srbct are used. It can be seen that these three different initialization methods can get the same results, indicating that AHIHT is not sensitive to initialization when applied to feature selection.
Comparison of convergence speed and time consumption
Figure 6 plots the objective function value for each iteration of the three -norm based methods. As can be observed, our method can decrease the objective value quickly in early iterations, which indicates that our method converges faster than RPMFS and ESFS. Figure 7 shows the average computational time of each sparsity regularization method on the eight data sets, which also compares AHIHT with HIHT. It can be seen that the computational time of RFS, RPMFS and AHIHT are much less than the other three methods, the reason is that the computational complexity of RFS, RPMFS and AHIHT is in propotion to while for other methods it is in propotion to or . Comparing AHIHT with HIHT, it can be found that AHIHT can reduce the computational time of HIHT effectively, thus is more suitable for practical applications.
Comparison of AHIHT and HIHT
In this part, we compare the performance of HIHT and its accelerated version in terms of accuracy and the row-sparsity of the produced solutions with different values of . Figure 8 shows the results, in which data sets Breast3, Lung, and NCI are used. From this figure, it can be seen that, when is less than 0.1, HIHT can get a more sparse solution than AHIHT, while AHIHT can get higher classification accuracy than HIHT, it means that the features selected by AHIHT are more useful for classification than those of HIHT. What’s more, the results obtained by AHIHT demonstrate AHIHT is more robust to than HIHT. For HIHT, it should tune a small value of to get satisfactory results, while a small value of will spend more computational time, which has been shown in Fig. 7. In conclusion, AHIHT is more practical than HIHT for feature selection.
Conclusions
This paper proposes an effective algorithm to solve the original -norm regularization least square problem for multi-class feature selection, in which a homotopy iterative hard thresholding is proposed to optimize the proposed model, thus the exact row-sparsity solution can be obtained. What’s more, an accelerated version of HIHT is derived to reduce the computational time, that is more practical for real-word applications. Experimental results on eight biological data sets have shown that the proposed approach can achieve better or comparable classification performance compared with other six state-of-the-art feature selection algorithms.
Footnotes
Acknowledgments
This work is supported by National Natural Science Foundation of China (NSFC) under grant #61873067.
References
1.
DashM. and LiuH., Feature selection for classification, Intelligent Data Analysis1 (1997), 131–156.
2.
LiZ.LiuJ.YangY.ZhouX. and LuH., Clustering-guided sparse structural learning for unsupervised feature selection, IEEE Transactions on Knowledge and Data Engineering26 (2014), 2138–2150.
3.
ZhaoZ.HeX.CaiD. et al., Graph regularized feature selection with data reconstruction, IEEE Transactions on Knowledge and Data Engineering28 (2016), 689–700.
4.
QasimO.S.MahmoudM.S. and HasanF.M., Hybrid binary dragonfly optimization algorithm with statistical dependence for feature selection, International Journal of Mathematical Engineering and Management Sciences5 (2020), 1420–1428.
5.
SuchethaN.K.NikhilA. and HrudyaP., Comparing the wrapper feature selection evaluators on twitter sentiment classification, in: 2019 International Conference on Computational Intelligence in Data Science (ICCIDS), 2019, pp. 1–6.
6.
TangB.KayS. and HeH., Toward optimal feature selection in naive bayes for text categorization, IEEE Transactions on Knowledge and Data Engineering28 (2016), 1602–1606.
7.
WanC.WangY.LiuY.JiJ. and FengG., Composite feature extraction and selection for text classification, IEEE Access7 (2019), 35208–35219.
8.
SaeysY.InzaI. and LarranagaP., A review of feature selection techniques in bioinformatics, Bioinformatics23 (2007), 2507–2517.
9.
LeiB.Y.ZhaoY.J. et al., Adaptive sparse learning using multi-template for neurodegenerative disease diagnosis, Medical Image Analysis61 (2020), 101632.
10.
KiraK. and RendellL.A., A practical approach to feature selection, in: Proceedings of the Ninth International Workshop on Machine Learning, 1992, pp. 249–256.
11.
RaileanuL.E. and StoffelK., Theoretical comparison between the gini index and information gain criteria, Annals of Mathematics and Artificial Intelligence41 (2004), 77–93.
12.
PengH.LongF. and DingC., Feature selection based on mutual information criteria of max-dependency, max-relevance, and minredundancy, IEEE transactions on Pattern Analtsis and Machine Intelligence27 (2005), 1226–1238.
13.
GuyonI.WestonJ.BarnhillS. and VapnikV., Gene selection for cancer classification using support vector machines, Machine Learning46 (2002), 389–422.
14.
HallM.A. and SmithL.A., Feature selection for machine learning: Comparing a correlation-based filter approach to the wrapper, in: Proceedings of the Twelfth International Florida Artificial Intelligence Research Society Conference, 1999, pp. 235–239.
15.
WestonJ.ElisseeffA.ScholkopfB. and TippingM., Use of the zero-norm with linear models and kernel methods, Journal of Machine Learning Research3 (2003), 1439–1461.
16.
TibshiraniR., Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society. Series B: Methodological58 (1996), 267–288.
17.
GuiJ.SunZ.JiS.TaoD. and TanT., Feature selection based on structured sparsity: A comprehensive study, IEEE Transactions on Neural Networks and Learning Systems28 (2017), 1490–1507.
18.
NieF.HuangH.CaiX. and DingC.H., Efficient and robust feature selection via joint -norms minimization, in: Advances in Neural Information Processing Systems, 2010, pp. 1813–1821.
19.
YangY.ShenH.T.MaZ.HuangZ. and ZhouX., -norm regularized discriminative feature selection for unsupervised learning, in: Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence, 2011, pp. 1589–1594.
20.
ZhangJ.YuJ.WangJ. and ZengZ.Q., L2,1-norm regularized fisher criterion for optimal feature selection, Neurocomputing166 (2015), 455–463.
21.
ChenX.NieF.YuanG. and HuangJ.Z., Semi-supervised feature selection via rescaled linear regression, in: Twenty-Sixth International Joint Conference on Artificial Intelligence, 2017, pp. 1525–1531.
22.
LiX.ZhangH.ZhangR.LiuY. and NieF., Generalized uncorrelated regression with adaptive graph for unsupervised feature selection, IEEE Transactions on Neural Networks and Learning Systems30 (2019), 1587–1595.
23.
QianM. and ZhaiC., Joint adaptive loss and -norm minimization for unsupervised feature selection, in: 2015 International Joint Conference on Neural Networks (IJCNN), 2015, pp. 1–8.
24.
CaiX.NieF. and HuangH., Exact top-k feature selection via -norm constraint, in: Proceedings of the 23rd International Joint Conference on Artificial Intelligence, 2013, pp. 1240–1246.
25.
PangT.NieF.HanJ. and LiX., Efficient feature selection via l2,0-norm constrained sparse regression, IEEE Transactions on Knowledge and Data Engineering31 (2019), 880–893.
26.
YuanM. and LinY., Model selection and estimation in regression with grouped variables, Journal of the Royal Statistical Society. Series B: Statistical Methodology68 (2006), 49–67.
27.
JacobL.ObozinskiG. and VertJ.P., Group lasso with overlap and graph lasso, in: Proceedings of the 26th Annual International Conference on Machine Learning, 2009, pp. 433–440.
28.
YuanL.LiuJ. and YeJ., Efficient methods for overlapping group lasso, Advances in Neural Information Processing Systems24 (2011), 352–360.
29.
RaoN.CoxC.NowakR. and RogersT.T., Sparse overlapping sets lasso for multitask learning and its application to fMRI analysis, Advances in Neural Information Processing Systems26 (2013), 2202–2210.
30.
MasaeliM.DyJ.G. and FungG.M., From transformation-based dimensionality reduction to feature selection, in: Proceedings of the Twenty-Seven International Conference on Machine Learning, 2010, pp. 751–758.
31.
TaoH.HouC.NieF.JiaoY. and YiD., Effective discriminative feature selection with nontrivial solution, IEEE Transactions on Neural Networks and Learning Systems27 (2016), 796–808.
32.
YanH. and YangJ., Robust joint feature weights learning framework, IEEE Transactions on Knowledge and Data Engineering28 (2016), 1327–1339.
33.
NieF.ZhuW. and LiX., Structured graph optimization for unsupervised feature selection, IEEE Transactions on Knowledge and Data Engineering33 (2021), 1210–1222.
34.
WonD.ManzourH. and ChaovalitwongseW., Convex optimization for group feature selection in networked data, INFORMS Journal on Computing31 (2020), 182–198.
35.
KarushW., Minima of functions of several variables with inequalities as side constraints, in: M.S. thesis, Dept. Math., Univ. Chicago, Ghicago, IL, USA, 1939.
36.
BlumensathT. and DaviesM.E., Iterative thresholding for sparse approximations, Journal of Fourier Analysis and Applications14 (2008), 629–654.
37.
LuZ., Iterative hard thresholding methods for l0 regularized convex cone programming, Mathematical Programming147 (2014), 125–154.
38.
JiangQ.de LamareR.C.ZakharovY.LiS. and HeX., Knowledge-aided normalized iterative hard thresholding algorithms for sparse recovery, in: 26th European Signal Processing Conference, 2018, pp. 1965–1969.
39.
DongZ. and ZhuW., Homotopy methods based on l0-norm for compressed sensing, IEEE Transactions on Neural Networks and Learning Systems29 (2018), 1132–1146.
40.
GeJ.LiX.G.JiangH.M. et al., Picasso: A sparse learning library for high dimensional data analysis in R and Python, Journal of Machine Learning Research20 (2019), 1–5.
41.
PomeroyS.L.TamayoP.GaasenbeekM. et al., Prediction of central nervous system embryonal tumour outcome based on gene expression, Nature415 (2002), 436–442.
42.
van’t VeerL.DaiH.van de VijverM. et al., Gene expression profiling predicts clinical outcome of breast cancer, Nature415 (2002), 530–536.
43.
NuttC.ManiD.BetenskyR.A. et al., Gene expression-based classification of malignant gliomas correlates better with survival than histological classification, Cancer Research63 (2003), 1602.
44.
BhattacharjeeA.RichardsW.G.StauntonJ. et al., Classification of human lung carcinomas by mrna expression profiling reveals distinct adenocarcinoma subclasses, Proceeding of National Academy Science of United States America98 (2001), 13790–13795.
45.
AlizadehA.A.EisenM.B.DavisR.E. et al., Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling, Nature403 (2000), 503–511.
46.
JeffreyS.S.de RijnM.V.WalthamM. et al., Systematic variation in gene expression patterns in human cancer cell lines, Nature Genetics24 (2002), 227–235.
47.
SinghD.FebboP.G.RossK. et al., Gene expression correlates of clinical prostate cancer behavior, Cancer Cell1 (2002), 203–209.
48.
KhanJ.WeiJ.S.RingnerM. et al., Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks, Proceeding of National Academy Science of United States America7 (2001), 673–679.
49.
BrownG.PocockA.ZhaoM.-J. and LujanM., A unifying framework for information theoretic feature selection, Journal of Machine Learning Research13 (2012), 27–66.