Abstract
Abstract
Sparse linear models approximate target variable(s) by a sparse linear combination of input variables. Since they are simple, fast, and able to select features, they are widely used in classification and regression. Essentially they are shallow feed-forward neural networks that have three limitations: (1) incompatibility to model nonlinearity of features, (2) inability to learn high-level features, and (3) unnatural extensions to select features in a multiclass case. Deep neural networks are models structured by multiple hidden layers with nonlinear activation functions. Compared with linear models, they have two distinctive strengths: the capability to (1) model complex systems with nonlinear structures and (2) learn high-level representation of features. Deep learning has been applied in many large and complex systems where deep models significantly outperform shallow ones. However, feature selection at the input level, which is very helpful to understand the nature of a complex system, is still not well studied. In genome research, the cis-regulatory elements in noncoding DNA sequences play a key role in the expression of genes. Since the activity of regulatory elements involves highly interactive factors, a deep tool is strongly needed to discover informative features. In order to address the above limitations of shallow and deep models for selecting features of a complex system, we propose a deep feature selection (DFS) model that (1) takes advantages of deep structures to model nonlinearity and (2) conveniently selects a subset of features right at the input level for multiclass data. Simulation experiments convince us that this model is able to correctly identify both linear and nonlinear features. We applied this model to the identification of active enhancers and promoters by integrating multiple sources of genomic information. Results show that our model outperforms elastic net in terms of size of discriminative feature subset and classification accuracy.
1. Introduction
S
LASSO is a special case of elastic net by setting α = 1.
The popularity of sparse linear models is due to the following reasons. First, their concept is easy to understand. Second, variables can be selected. Third, the learning of model parameter

A structural comparison of our deep feature selection (DFS) models and previous ones.
Through piling up hidden layers, deep neural networks are able to model nonlinearity of features. Figure 1C is an example of such a deep model called multilayer perceptrons (MLP), which is a deep feed-forward neural network. The techniques of learning deep models and their inferences fall into an active research frontier—deep learning (Hinton et al., 2006), which has four attractive strengths for applying it in complex intelligent systems: First, deep models often dramatically increase prediction accuracy. Second, they can model processes of complex systems. Third, they can generate structured high-level representations of features that can help the interpretation of data. Fourth, (convolutional) deep learning models are robust to temporal or spatial variation. But the learning of such models are usually nonconvex in optimization, and the back-propagation algorithm (a first-order method) does not perform well on deeper structures. The optimization strategy using greedy layer-wise unsupervised pretraining and finetuning, proposed in Hinton et al. (2006), is considered as a breakthrough. While high-level feature extraction and representation have been intensively studied in the surge of deep learning research (Bengio et al., 2013), feature selection at the input level is still not well studied. However, in bioinformatics and other studies on complex systems, selecting key input features are crucial to understanding the mechanisms of the systems. Thus, existing models mentioned above do not meet this need. In our current bioinformatics research, we are committed to devising a deep learning model for the identification and understanding of cis-regulatory elements in the human genome.
Genome researchers have discovered that noncoding DNA sequences (previously viewed as junk DNA) are composed of many regulatory elements (The ENCODE Project Consortium, 2012). These elements (including enhancers and promoters) precisely control the expression level of genes. Promoters are cis-acting DNA sequences that switch on or off the expression of genes, while enhancers are generally cis-acting DNA sequences that tune the expression level of genes (Shlyueva et al., 2014). A promoter resides close to its target gene, while an enhancer is distal to its target gene(s), making it difficult to identify. The identification of active enhancers and promoters in a genome is of key importance, as it can help to elucidate the regulatory mechanism in the genome and interpret disease-causing variants within cis-regulatory elements. However, since regulatory landscapes of DNA are quite different among cell types, and regulatory events are precisely and dynamically controlled by multiple factors, including epigenetic marks, transcription factors, microRNAs, and their interactions, it is a difficult task to identify active enhancers and promoters in a given cell type. The emergence of both deep sequencing and deep computing techniques casts light on this problem.
In order to select key input features for identifying and understanding regulatory events, we propose a deep feature selection model that enables variable selection for deep neural networks. In this model, a sparse one-to-one layer, where each input feature is weighted, is added between the input and the first hidden layer, giving two advantages: (1) a single subset of features for multiple classes (multiple output nodes) can be conveniently selected, which addresses the challenge of multiclass extension of linear models; and (2) through selecting features at the input level of the deep structure, we are able to identify informative features that have nonlinear behaviors.
2. Method
2.1. Deep feature selection
We focus our research on feature selection for multiclass data using deep neural networks. We propose a deep feature selection (DFS) model that can select features at the input level of a deep network. An example of such a model is illustrated in Figure 1D. Our main idea is to add a sparse one-to-one linear layer between the input layer and the first hidden layer of an MLP. In this one-to-one layer, the input feature xi only connects to the i-th node with linear activation function. Thus, the output of the one-to-one layer becomes
where
2.2. Learning model parameter
Suppose there are K hidden layers in a DFS model. Its model parameter can be denoted by
which is explained as follows.
1. l(
where
2. Regularization term
3. Regularization term
In the neural network community, it is well-known that Equation (3) is non-convex, and the gradient descent method (back-propagation) converges only to a local minimum of the weight space. Practically, it performs fairly well with a small number of hidden layers. However, as the number of hidden layers increases, this algorithm would deteriorate, because gradient information disperses or explodes in lower layers. So for a small number of hidden layers, we directly use a back-propagation algorithm to train our DFS model. For a large value of K, if back-propagation does not perform well, we resort to stacked contractive autoencoder (ScA) or deep belief network (DBN). The ScA and DBN-based DFS models are pretrained in a greedy layer-wise way, and then finetuned by back-propagation.
Although the objective f (
2.3. Shallow DFS is not equivalent to LASSO
Is the result of a shallow DFS model (Fig. 1B) equivalent to that of LASSO (Fig. 1A)? If so, there is no need to build the DFS model except for practical reasons; features could be simply selected by making
We have the following proposition:
then Equation (7) would be equivalent to Equation (6). However, we cannot. This is because
2.4. Alternative: l2,1-norm based deep feature selection
Inspired by works of non-negative matrix factorization (NMF) (Nie et al., 2010; Kong et al., 2011), we learn that the row-wise sparseness of a matrix can be alternatively achieved using l2,1-norm. The l2,1-norm of matrix
Thus, an alternative to our shallow DFS is applying l2,1-norm to the weight matrix of the structure depicted in Figure 1A, that is
The multiclass group LASSO (Sanders, 2009) and multiclass l2,1-norm support vector machine (Cai et al., 2011), which defines each row vector of the coefficient matrix as a group, is essentially an l2,1-norm regularized linear model. The l2,1-norm-based deep DFS model employs l2,1-norm to the coefficient matrix of the first hidden layer of Figure 1C. Thus, the corresponding objective function to be minimized can be written as
3. Comparing Models on Synthetic Data
Since it is often difficult to validate whether features highlighted by a feature selection method are informative in the real world, we generated two simulated data sets, using nonlinear rules and quadratic polynomial, respectively. The first data set, named Simulation1, has nine features and three classes. Distributions of these features and their relationships with the class labels are shown in Figure 2. Among these features, X0 follows a uniform distribution within the interval of [−1, 1]. We let X0 = 0 linearly separate class 0 from class 1 and class 2. Feature X1 negatively correlates with X0. X2 and X3 are nonlinear features with exclusive-OR (xor) relationships with the classes 1 and 2. X4, X5, X6, X7, and X8 are redundant features, following normal, Laplace, and uniform distributions, which are independent of class labels. We generated 3000 samples for each class, thus 9000 in total. The second data set is named Simulation2, which has seven features and four classes. We let X0 ∼ Γ(6, 1), X1 ∼ Γ(8, 1), X2 ∼ Γ(3, 10), X3 ∼ Γ(10, 2), X4 ∼ Pois(3), X5 ∼ U(0, 1), and X6 ∼ Γ(0.3, 1). We generated a continuous response by

Distributions of features and their relationships to class labels on Simulation1.
Each simulated data set was equally split to training, validation, and test sets in order to evaluate deep DFS, shallow DFS, LASSO (Tibshirani, 1996), and random forest (Breiman, 2001) for feature selection. The DFS models learned on a training set and used validation error to avoid overfitting. LASSO and random forest were learned on a training set. The accuracy of predicting the labels of test samples were used to measure the performance of a model. In the deep DFS model, we used three hidden layers ({36, 18, 9} neurons for Simulation1, and {28, 14, 7} neurons for Simulation2) with ReLU as activation functions, respectively. In both deep DFS and shallow DFS, we set the initial learning rate to 0.1, λ2 = 1, α1 = 0.0001, and α2 = 0. We used values of λ1 ranging from 0.05 to 0. We used the glmnet package (Friedman et al., 2010) for LASSO, and the randomForest package (Liaw and Wiener, 2002) for random forest.
We recorded weights of the features and corresponding test accuracies in Table 1. The results on Simulation1 is shown at the top of Table 1. We found that deep DFS are able to discover the linear features, X0 and X1, and the nonlinear features, X2 and X3. Using the linear features only, deep DFS obtained an accuracy of 0.6723. Adding the nonlinear features, deep DFS achieved an accuracy of 0.9907. From the experiment, we can see that the shallow DFS can only recover the linear features, X0 and X1. Thus, this model can only correctly predict the class labels of two thirds of the test samples. Consistent with common knowledge, LASSO tended to select only one of the correlated linear features. Its classification performance is similar to shallow DFS. From our results, we also found that the DFS models can select correlated features (X0 and X1) simultaneously, while LASSO only utilizes one of them. The last row of Table 1 shows the importance of features computed by random forest. We can see that random forest has the capability of identifying both linear and nonlinear features. The accuracies between deep DFS and random forest models were comparable.
Large nonzero numbers in a sparse vector are highlighted in boldface.
The bottom of Table 1 shows the results on Simulation2. Deep DFS was able to identify all informative features and obtained the highest accuracy (0.9780). Shallow DFS only obtained an accuracy of 0.7902 using four features. It correctly selected features used in squared terms, but failed to recover features interacting with others. LASSO could select all informative features but could not fully model the nonlinearity. It obtained an intermediate result (0.8903). Random forest had a similar accuracy to LASSO. It tended to give a much higher score to X0, but underrepresented X1, X2, and X3.
Overall, we conclude that deep DFS can identify informative features involved in nonlinear rules and polynomial terms. It highlights important features with stronger distinction; that is, the difference in feature weights between the informative features and the noninformative features was greater for deep DFS than random forest. Shallow DFS can only identify features related to linear rules and noninteractive polynomial terms, thus has low performance. LASSO fails to recover features involved in nonlinear rules. The importance scores of random forest reflect the usefulness of features, but are less distinctive than DFS. Random forest has a comparable accuracy on the data generated by nonlinear rules, but gets lower performance on the data generated using quadratic polynomial.
4. Applying DFS to Enhancer–Promoter Classification
We applied the DFS model in the challenging problem of enhancer–promoter classification. In order to assess the performance of this model, we compared four models, including our deep DFS model having two hidden layers (Fig. 1D), our shallow DFS model having no hidden layer (Fig. 1B), elastic-net-based softmax regression (Fig. 1A), and random forest (Breiman, 2001). We shall first describe the genomic data we used. Then, we compare the prediction accuracy. Finally, we provide new insights into the features selected.
4.1. Data
We compared the models on our processed data sampled from annotated DNA regions of GM12878 cell line (a lymphoblastoid cell line). This data set has 93 features and three classes, each of which contains 2,156 samples. Based on the FANTOM5 promoter and enhancer atlases (The FANTOM Consortium et al., 2014; Andersson et al., 2014), each sample comes from one of the three classes of annotated DNA regions including active enhancer regions, active promoter regions, and background. The background class is a pool of inactive enhancers, inactive promoters, active exons, and unknown regions. The features include cell-ubiquitous characteristics such as CpG-islands and evolutionary conservation Phastcons score, and cell-specific events including DNA-accessibility, histone modifications, and transcription factor binding sites captured by the ENCODE consortium using ChIP-seq techniques (The ENCODE Project Consortium, 2012). For a fair comparison, we split our data set equally into training set, validation set, and test set. All models were trained on the same training set. The validation accuracy is used to monitor the training of the DFS models to avoid overfitting. The same test set was blinded in the training of all three models, so the test accuracy was used to examine the quality of feature subsets.
4.2. Comparing test accuracy
In our deep DFS model, we take the structure of {93 → 93 → 128 → 64 → 3} by a manual rough model selection, due to a concern about the efficiency of automatic model selection for deep models. We used tangent activation functions. We set the minbatch size to 100, the maximum number of epochs to 1000, the initial learning rate s = 0.1, and the coefficient of momentum α = 0.1, λ2 = 1, α1 = 0.0001, and α2 = 0. We conducted feature selection for values of λ1 from the range of [0, 0.03] by a step of 0.0002. Our shallow DFS model has a structure of {93 → 93 → 3}. For this model, we tried values of λ1 from [0, 0.07] by a step of 0.0005. The rest of the user-specified parameters were kept the same as the deep DFS above. Elastic-net-based softmax regression simply has a structure of {93 → 3}. We tried different values of α. We used the glmnet package for it, thus the full regularization path with a fixed value of α was produced by a cyclic coordinate descent algorithm (Friedman et al., 2010). For random forest, we applied the randomForest package in R (Liaw and Wiener, 2002).
The test accuracies versus the sizes of feature subsets are illustrated in Fig. 3A. From a feature selection context, we focus the comparison on the critical region as highlighted by a rectangle in this plot. In this region, the paired Wilcoxon signed-rank test was conducted to check whether a classifier significantly outperforms another one (see Fig. 3B). In addition to accuracy, the confusion matrices of different models, when selecting 16 features, are given in Figure 3C. First of all, with the comparison between our shallow DFS model and elastic net, it can be seen that our shallow model has a significantly higher test accuracy than elastic net for the same number of selected features. From a computational viewpoint, it hence corroborates that adding a sparse one-to-one layer is a better technique than the tradition of simply combining the feature subsets selected for each class. Second, from the comparison of our deep and shallow DFS models, it shows that a significantly better test accuracy can be obtained by our deep model. It hence supports that considering the nonlinearity among the variables can contribute to the improvement of prediction capability. Third, it is interesting to see that random forest with certain top-ranked features performs better than the deep learning model. This may be because the structure and parameter of the deep model was not optimized. Finally, from the confusion matrices as shown in Figure 3C, we highlight that some active promoters tend to be classified as active enhancers.

Comparison of classification performance on enhancer and promoter data.
4.3. Feature analysis
We analyzed the features selected by the DFS models, LASSO, and random forest. Regularization path (Tibshirani, 1996; Efron et al., 2004) is a well-known method to illustrate the importance of selected features through plotting coefficients in curves. However, it becomes less informative as the number of features increases. Thus, as shown in Figure 4, we instead use a heatmap representation of feature coefficients that is equivalent to the regularization path. Since LASSO is supposed to have three (each for a class) heatmaps, we combined them by taking the corresponding maximal absolute values of its coefficients. That is, for a value of λ, we convert matrix

Coefficient heatmaps (equivalent to the conventional regularization paths) of the DFS and LASSO models, and the importance of features ranked by random forest.
First, we can see that the heatmaps of our shallow and deep DFS models are much sparser than that of LASSO. This implies that our scheme using a sparse one-to-one weighting layer is able to select a small subset of features along the regularization path, while LASSO tends to select more features, because it fuses all class-specific feature subsets. Second, comparing the result of the shallow DFS and LASSO, we can see many differences. For example, LASSO emphasizes CpG islands, TBLP1, and TBP, while they are not selected by the shallow DFS until later in the process. Instead, the heatmap of the shallow DFS indicates that ELF1, H2K27ac, Pol2, RUNX3, etc., are important features.
From GeneCards (Rebhan et al., 1997) and literature, we surveyed the known functionality of features selected by the deep and shallow DFS in an early phase. The functionality and specificity of these features are given in Table 2, where the last column is our conclusion about the binding specificity of the features based on the box-plots (see Fig. 5) of our data. First, the table shows that deep DFS identifies more key features earlier than shallow DFS, such as BCL11A, HeK27me3, H3K4me1, H4K20me1, NRSF, TAF1, and TBP. Interestingly, the deep DFS found a nonlinear relation: TAF1 and TBP are actually components of TFIID functioning as RNA polymerase II locator. Second, we can see that the known functionality of the majority of selected features, as highlighted in bold in Table 2 (i.e., ELF1, H3K27ac, Pol2, BATF, EBF1, H3K36me3, H3K4me2, NFYB, RUNX3, BCL11A, H3K27me3, H3K4me1, and NRSF), are consistent with the binding specificity drawn from our data. From the box-plots of our data (Fig. 5), we are also able to identify novel enrichment of some features (emphasized by italic type in Table 2) in enhancers and inactive elements. For example, H3K9ac is thought to be enriched in actively transcribed promoters (Kratz et al., 2010); our results show that it is also enriched in active enhancers. H4K20me1 is reported to be enriched in exons (Vakoc et al., 2006); our results also show that both inactive enhancers and inactive promoters are enriched with H4K20me1. TAF1 and TBP is known as a promoter binder; our results show that they are also associated with active enhancers. Finally, it has to be mentioned that some cell-specific features can be identified by the DFS models. From Table 2, we can see that ELF1 (Bredemeier-Ernst et al., 1997), BATF (Ise et al., 2011), EBF1 (Nechanitzky et al., 2013), and BCL11A (Lee et al., 2013) are specific to lymphoid cells (recall that GM12878 is a lymphoblastoid cell line from blood). This further confirms that the selected features are highly informative.

Box plots of features selected by DFS in different classes. In addition, the last two features (CpGIslands and TBLR1) are the features selected by LASSO. The y axis is the fold change (experiment vs. input) of ChIP-seq signal. From the box plots, feature enrichment in different classes can be observed.
The last column is the binding specificity of these features based on box-plots of our data (Fig. 5). Features consistent between known functions and binding specificity are highlighted in boldface. Features having novel enrichment are emphasized in italic type. A, active; I, inactive; P, promoter; E, enhancer; Ex, exon.
Random forest is suitable for multiclass data. It can return the importance of each feature by measuring the decrease of out-of-bag error by permuting the values of this feature (Breiman, 2001). We compared the features selected by our models with the ones ranked by random forest, as shown in Figure 4D. The majority of informative features selected by the DFS models are top-ranked in random forest, except that NFKB and ELF1 are scored as less important. It may be because our DFS model considers the dependency of the features, while random forest independently measures the impact of removing each feature from the model.
5. Computing Time
In Table 3, we recorded the training times of the four models tested on the synthetic and real data sets. Random forest and shallow DFS ran the fastest, taking a few seconds (even less than one second) to learn. Elastic net spent a few more seconds in training. The deep DFS model “unsurprisingly” consumed more time to finish the training procedure.
6. Conclusion and Future Works
Linear methods do not model the nonlinearity of variables and can not be extended to multiclass in a natural way for feature selection, while deep models learn nonlinearity and high-level representation of features. In this article, we propose a deep feature selection model for selecting input features in a deep structure, especially for multiclass data. We applied this model to distinguish active promoters and enhancers from the rest of the genome. Our results show that our shallow and deep DFS models are able to select a smaller subset of features than LASSO with comparable accuracy. Furthermore, our deep DFS can select discriminative features that may be overlooked by the shallow DFS. Through looking into the genomic features selected, we find that the features selected by DFS are biologically plausible. Furthermore, some selected features have novel enrichment in regulatory elements. We also evaluated the new model on simulated data in order to understand its behavior. Our results confirm that the deep DFS is able to recover both linear and nonlinear features. We implemented the DFS model in Python based on the Theano and Deep Learning Tutorials. This model together with our modification of the Deep Learning Tutorials led to a convenient deep learning package accessible at Li and Wyeth (2015).
More investigations should be conducted for improving the stability and performance of DFS. A few are enumerated below. First, we notice that the sign of weights in the one-to-one layer is not informative, because the sign can effectively be reversed by later layers. We thus plan to impose a non-negativity constraint on the weights of the one-to-one layer. Second, an observed limitation is the presence of negligible weights in the feature selection layer. Other sparse regularization techniques, for example, spike-and-slab (Goodfellow et al., 2012), can be tested to achieve a better quality of sparseness. Third, we note that DFS can be influenced by the purity of each minibatch. More research needs to be done to unveil the reason. Fourth, it is necessary to study whether there exists an efficient algorithm for computing the regularization path of the DFS model. Fifth, we will build a deep feature selection model to select variables for regression problems. We remain interested in devising feature selection methods for other deep learning models.
Footnotes
Acknowledgments
We thank the anonymous reviewers for constructive comments to improve the quality of this article. Dr. Anthony Mathelier (UBC) and Wenqiang Shi (UBC) provided valuable suggestions. Dr. Anshul Kundaje (Stanford) provided valuable instruction during the processing of ChIP-seq data from ENCODE. This research is supported by the Genome Canada Applied Bioinformatics of Cis-regulation for Disease Exploration project (ABC4DE, to
), the Natural Sciences and Engineering Research Council of Canada (NSERC) Postdoctoral Fellowship (PDF-471767-2015, to YL), and the UBC Four Year Doctoral Fellowship (to CYC).
Author Disclosure Statement
The authors declare that no competing financial interests exist.
