Feature selection using autoencoders with Bayesian methods to high-dimensional data

Abstract

It is easy to lead to poor generalization in machine learning tasks using real-world data directly, since such data is usually high-dimensional dimensionality and limited. Through learning the low dimensional representations of high-dimensional data, feature selection can retain useful features for machine learning tasks. Using these useful features effectively trains machine learning models. Hence, it is a challenge for feature selection from high-dimensional data. To address this issue, in this paper, a hybrid approach consisted of an autoencoder and Bayesian methods is proposed for a novel feature selection. Firstly, Bayesian methods are embedded in the proposed autoencoder as a special hidden layer. This of doing is to increase the precision during selecting non-redundant features. Then, the other hidden layers of the autoencoder are used for non-redundant feature selection. Finally, compared with the mainstream approaches for feature selection, the proposed method outperforms them. We find that the way consisted of autoencoders and probabilistic correction methods is more meaningful than that of stacking architectures or adding constraints to autoencoders as regards feature selection. We also demonstrate that stacked autoencoders are more suitable for large-scale feature selection, however, sparse autoencoders are beneficial for a smaller number of feature selection. We indicate that the value of the proposed method provides a theoretical reference to analyze the optimality of feature selection.

Keywords

Autoencoder Bayesian method feature selection high-dimensional data

1 Introduction

Feature selection is called feature subset selection, i.e., N features are selected from the existing M features to optimize the specific index of system. Certainly, feature selection is also considered to be the process of selection some of the most effective features from the original features to reduce the dimension of data set [1, 2]. As is known to all, feature selection removes redundant and irrelevant features in data, thus suffering the less negative impact from the curse of dimensionality [3]. Particularly, in classification tasks, feature selection increases the precision of classification, so that the information for key features can be better understood [4]. Meanwhile, these useful features retained by feature selection are also used for subsequent learning tasks [5, 6]. As a result, feature selection has great importance effects on machine learning.

In many applications, data sets are usually rich features. Feature selection, aiming to select the most discriminative or informative features from data sets, is a hot topic in recent years [7 –10]. There is two major challenges of feature selection. One is that the representation of data distribution in high-dimensional space requires a sufficient number of information [11], while data in high-dimensional spaces distributes too sparse to afford rich information. The other is that there exists an exponential searching space due to high dimensionality, i.e., the cruse of dimensionality), causing a negative influence to feature selection tasks [12 –14]. Especially, the situation becomes worse while irrelevant features,.e.g., noise, produce interference [15]. Overall, it is a challenge for feature selection in high-dimensional data.

For machine learning tasks, feature selection has effects on the ability of model’s generalization, so the methods of feature selection are very important. Usually, the existing methods of feature selection are filter, such as in [16], wrapper, e.g., in [17], and embedding methods, e.g., in [18], in [19] and in [20]. Filter methods have to select data features before training learners. Through using evaluation functions, the correlation between features is declined while increasing the correlation between categories and features [21]. In practice, this is difficult to select or construct an objective evaluation function. While for wrapper methods, the selected features heavily reply on the capability of models, e.g., decision tree algorithm [22], support vector machine [23], etc. Although wrapper methods can select more useful features than filter methods, their evaluation mechanism is quite time-consuming since all possible feature subsets need to be evaluated, especially data is in very high-dimensionality. For embedding methods, all features as a whole are considered, certainly, also considering learning performance. Moreover, embedded methods integrate the process of feature selection and the training process into one process. These above involved conventional methods assign a common discriminative feature set to the whole sample space without considering the local behavior of data in different regions of the feature space [24, 25].

Recently, deep approaches have been successfully used for feature selection. For instance, in [26], neural networks controlled redundancy are proposed, similarly, in [27], in [28], and in [29]. While for deep neural networks, Scardapane [30] proposed a very general sparse regularization strategy, in order to deal with how to select the input variables and hidden nodes simultaneously.

Autoencoders is neural networks, which learn an encoding function of mapping background space to representation space as well as a decoding function of reconstructing original inputting from representation space [31], respectively. Autoencoders not only reduce data dimensionality, but also provide an effective approach as regard the unknown meaningful insights for classification discovery [15]. Moreover, autoencoders are substantially better capability in extracting semantic rich features and non-linear feature relations, since the representations learned by them are qualitatively easier to be interpreted [32]. Compared to these mainstream methods, e.g., in [16 –23], feature selection approaches using autoencoders offer more powerful dimensionality reduction [33]. For instance, Hinton et al. [34] design a deep regularized autoencoder, whose the error is lower than that of principal components analysis method in dimensionality reduction. Lucas et al. [35] compress image using compressive autoencoders. In addition to this, several types of regularized autoencoders have been also used for feature selection [36]. For example, in [37], the interpretable and discriminative features are learned by using a simple sparse auto encoder, similarly, in [38 –41].

Bayesian methods can estimate some unknown states with subjective probability under incomplete information, so they play a prominent role in solving the variable selection problem. For instance, in [42] and in [43], a Bayesian classifier is used for feature selection to text. Abbas et al. [44] achieve feature selection tasks by consisting of the Particle Swarm Optimization algorithm and the Bayesian methods. Bayesian methods not only show excellent ability of feature selection for text, but also have outstanding talent as regards feature selection for high-dimensional data. Zhao et al. [45] propose a novel Bayesian regression framework, successful selecting features from ultra-high dimensional data. Bayes methods exhibit these excellent capabilities in feature selection due to supporting accurate perception of informative data [46]. This perception ability is mathematically interpreted in detail in [47 –50]. Moreover, Bayesian method is also less sensitive to missing data.

Given these complementary advantages between autoencoders and Bayesian methods, this is very attractive to study a hybrid approach of them for feature selection. In this work, our primary goal is to select non-redundant features from high-dimensional data. However, our final goal is to explore the ability of deep architectures for feature selection. Here, we proposed a hybrid model consisted of an autoencoder and the Bayesian method (AEBd) to achieve our studied targets in the following steps: 1) an autoencoder owned three-hidden layers is designed. The first and second hidden layer are used to for non-redundant feature selection. 2) The third hidden layer, namely ‘Bayesian layer’, is a special layer. Because Bayesian methods consider the occurrence probability of various events and the loss caused by misjudgments, the Bayesian layer is suitable to control feature misjudgments during selecting. Finally, the proposed method is tested and validated comprehensively on a handwritten digits recognition task.

We summarize the main contributions of this work as follows:

The hybrid model consisting of an autoencoder and Bayesian methods is proposed for feature selection to high-dimensional data. The proposed model improves the precision of feature selection while reducing the probability of feature misjudgments during selecting.

As regards feature selection, autoencoders combined by probabilistic correction methods are more valuable than stacked architectures or adding constraints to autoencoders.

Stacked autoencoders have more advantages for a larger amount of feature selection, while sparse autoencoders are beneficial for a smaller number of feature selection.

2 Preliminary

In section 2.1, we review Bayesian methods, which is helpful for optimizing our model. In section 2.2, autoencoders are described. These knowledge provides theoretical support for our model.

2.1 Bayesian method

Let us assume that sample x contains C categories w₁,...,w_m. Then Bayesian method judges the sample category, having that

$\begin{matrix} max {p (w_{1} | x), . . ., p (w_{i} | x), . . . ., p (w_{m} | x)}, \\ i = 1, . . ., m \end{matrix}$ (1) where, the p(w_i|x) is the probability of category i^th. Equation (1) shows that Bayesian method gives a reasonable choice according to the maximum of p(w_i|x).

We hope that the classification error rate can be minimized while classifying, having that $min P (e) = \int P (e | x) p (x) dx = E [P (e | x)]$ (2)

In Equation (2), for all x, P(e|x)>0 and p(x)>0 hold. So the min P(e) means that this minimize all x. According to Bayesian method, the decision of minimization error rate is the decision of maximization posterior probability. The posterior probability of each class can be regarded as a discriminant function of class. The decision process is to compare all kinds of discriminant functions and finally choose the largest one. For a problem with class C classification, the average error rate should be weighted by the C(C-1) terms. Due to the large amount of calculation of error rate, it can be converted to calculate the average correct rate P(C) to calculate error rate, having that $P (e) = 1 - P (C) = 1 - \sum_{j = 1}^{m} P (w_{j}) \int p (x | w_{j}) dx$ (3)

In decision, we are not only concerned with the correctness of decision, but also are concerned with the loss caused by an inappropriate decision, having $λ (α_{i}, w_{j}) = λ (g (x) = w_{i} | w_{j}), i, j = 1, . . ., C$ (4)

Equation (4) is a loss function, which represents the loss for the misjudgment of class j as class i. The loss matrix C×C is constituted by λ. The decision function of the minimal risk Bayesian is as following $arg min R_{i} (x) = \sum_{j = 1}^{C} λ (α_{i}, w_{j}) P (w_{j} | x)$ (5)

For multiple classification problems, Bayesian methods are highly efficient, and do not increase too calculation complexity. In the case where the assumption of distribution independence is true, Bayesian methods perform exceptionally well, and is better than logistic regression.

2.2 Autoencoders

Autoencoders, which are unsupervised neural networks, learn the implicit features of input data. Obviously, autoencoders are used for feature dimensionality reduction, similarly, principal component analysis (PCA). But their performance is better than that of PCA, because autoencoders can extract more effective new features. Besides of feature dimensionality reduction, autoencoders also act as feature extractors. The new features learned by them can be fed into the supervised learning models. Certainly, as unsupervised learning, autoencoders are also allowed to generate new data of different from training sample. So that autoencoders are usually considered to be generative models.

Let x = [x₁, x₂,..., x_n] be the input of autoencoders, the compressed representation y = [y₁, y₂,..., y_m] in the hidden layer, and the output $x^{'} = [x_{1}^{'}, x_{2}^{'}, . . ., x_{n}^{'}]$ . n and m are the number of nodes in the input and hidden layers, respectively. Let us denote that the encoding and the decoding weight matrix are W ∈ R^(m×n) and W′ ∈ R^(n×m). b ∈ R^m and b′ ∈ Rⁿ are the hidden layer and output layer bias item, respectively.

In encoding, i.e., y = f (Wx + b), it can be seen that encoding is a linear combination followed by a nonlinear activation function. Without the non-linear wrapping, autoencoders are no different from a regular PCA method. Using the obtained y, the input x can be reconstructed, i.e., decoding stage x′ = f (W′x + b′).

Autoencoders learn latent, compressed representations of the input through minimizing the error between the input and the reconstructed output [51]. Since it is no meaningful to simply reconstruct the original input, in order to learning more meaningful representations, we expect that autoencoders can capture the more valuable original information. Usually, we give some restrictions on autoencoders, e.g., W′ = W^T. In addition, also including appending constraints, for instance, denoising autoencoder and sparse autoencoder are belong to this kind of case.

3 Method

In section 3.1, Bayesian combination methods are described in detail. In section 3.2, the proposed model is given, and then the model’s rationality is interpreted. Finally, training and testing of the model are displayed in section 3.3.

3.1 Bayesian combination methods

Given a class label, let us assume that classifiers are mutually independent, i.e., conditional independence. The item p(c_i) denotes the probability that sample x is tagged by classifier L_i in class c_i ∈ Class list.

The results of conditional independence is given, having that $p (c | w_{k}) = p (c_{1}, c_{2}, . . ., c_{n} | w_{k}) = Π_{i = 1}^{n} p (c_{i} | w_{k}),$ (6) where, n is the number of classifiers. w_k is the class k from the class set. The posterior probability for tagging x is given in Equation (7).

$\begin{matrix} p (w_{k} | c) & = \frac{p (w_{k}) p (c | w_{k})}{p (c)} \\ = \frac{p (w_{k}) Π_{i = 1}^{n} p (c_{i} | w_{k})}{p (c)}, k = 1, 2, . . . \end{matrix}$ (7)

Since the denominator in Equation (7) is not based on w_k (that is a property of Bayes formula), this part can be ignored. As a result, w_k is calculated in the following equation. $Ξ_{k} (x) \propto p (w_{k}) Π_{i = 1}^{n} p (c_{i} | w_{k})$ (8)

Thereafter, we discuss how to calculate a C×C confusion matrix for classifier L_i. C is the number of the classes. In the confusion matrix M, Let the M_i(j,u) represents the number of dataset elements which have the class label of w_j, and are assigned to class w_u using the classifier L_i. D_j determines the number of class w_u.

According to the [52], the probability estimate p(c_i|w_j) and the prior probability for class c_j are replaced by using the $\frac{M_{i} (j, u_{i})}{D_{j}}$ and $\frac{N_{j}}{N}$ , respectively. As such, Equation (8) is modified using Equation (9), having $Ξ_{k} (x) \propto Π_{i = 1}^{n} \frac{M_{i} (k, u_{i}) + \frac{1}{C}}{D_{k} + 1}$ (9)

Noting that if zero is used as the estimate of p(c_i|w_k) in Equation (8), this automatically nullify Ξ_k (x). Hence, Equation (9) indicates the probability that sample x labeled as class c_j is Ξ_k (x).

3.2 Model architecture and parameters

We developed the proposed AEBd, including three hidden layers, shown Fig. 1. The first and second hidden layer are used for non-redundant features selection. Bayesian methods are designed the third hidden layer, namely ‘Bayesian layer’, this purpose is to increase the precision during selecting non-redundant features. In input layer, the number of neurons is equal to input data dimensionality. In output layer, the final features selected are presented.

Fig. 1

The architecture of AEBd.

The activation f of the ith node (i = 1,2,..) in the jth (j = 1,2) hidden layer is given by Equation (10). W_ij¹² denotes the connection weight in the input layer to the jth hidden layer. b_i¹² denotes the bias for the ith hidden node. The activation f of the lth node (l = 1,2,..) in the third hidden layer is given by Equation (11). The f is the logistic sigmoid function in Equation (12), which is widely used in machine learning and pattern recognition tasks [40]. In addition, the proposed model is trained by using the loss function of minimizing the negative logarithmic likelihood, i.e., L = - log P (x|x′)

${\begin{matrix} H_{1, 2} = f (\sum_{j = 1}^{2} W_{ij}^{(1, 2)} x_{j} + b_{i}^{(1, 2)}) (10) \\ H_{3} = f (\sum W_{l 3} * Ξ_{k} (x) + b_{l}^{(3)}) (11) \\ f (u) = \frac{1}{(1 + exp (- u))} (12) \end{matrix}$

Rationality. We adopt the hybrid architectures combining an autoencoder with Bayesian methods, because we take advantages of them, which are excellent ability to capture feature (the former) and correction capability of classification features (the latter). The high-quality nonlinear features can be obtained by deep architectures from the high-dimensional data [53 –55]. Obviously, autoencoders can further tune the acquired features accuracy after combining Bayesian method, due to Bayesian methods consider the occurrence probability of various reference events and the loss caused by misjudgment. If the two can be integrated, the proposed AEBd can be expected to select higher quality features to high-dimensional data.

3.3 Training and testing

Training. To reduce the risk of over-fitting, in process of training, the dropout method worked by probabilistic turning off some neurons is used for the first and second hidden layers in AEBd. Until the parameters converge, then the training is completed.

Testing. Once AEBd is well trained, given a testing data set, AEBd presents the output results.

4 Experimental settings

In section 4.1, we describe the dataset. In section 4.2, the competitors and theirs parameters are presented. In addition, assessment metric is also given in section 4.3.

4.1 Datasets description

The MNIST digits dataset is a popular dataset of handwritten digits of being widely used for machine learning, including digits from 0 through 9. The MNIST, which has a training set of 60,000 images, and a test set of 10,000 images, has effectively become a benchmark. Each image in the MNIST dataset has not only been sized and normalized, but also has been centered in a fixed-size image. Hence, the MNIST is very suitable for verify our method

4.2 Comparison methods and parameters

In order to objectively assess the ability of our method, the hybrid approach, i.e., Bayesian AutoEncoder (BAE) in [45], is used as a compared object. In addition, we also opt for two kinds of autoencoders as competitors, i.e., the Stacked AE in [40], and the sparse autoencoder (SAE) in [41]. The sparse autoencoders are a type of regularized autoencoders by adding a penalty term into loss function to make representation layer sparse. The reason of selection the two autoencoders as compared objects is that we deeply explore the effects on the accuracy of feature selection via changing the autoencoder architectures, e.g., stacked patterns, or adding constraints to autoencoders, e.g., adding sparse items.

For the three competition methods, we apply default parameters observed in the corresponding literature. Unless otherwise stated, all methods in this work all run on the same experimental settings.

4.3 Assessment metric

The receiver operating characteristic curve (ROC) and corresponding area under curve (AUC) are used to assess the accuracy of methods.

5 Results

All experimental results show that the proposed AEBd outperforms the three competitors BAE, Stacked AE and SAE in all considered cases as for the precision of feature selection. Section down below detailed the experimental results.

5.1 Accuracy of feature selection

To compare the performance of the four methods, we assess the accuracy with different number of features ranging from 20 to 120. Figure 2 displays the results of feature selection on the MNIST digits dataset. It can been seen that the performance of the four methods is decreased when more features are selected. Despite this, but the performance of the proposed AEBd is still better than the three competitors, as shown Fig. 2 (a). Further analyzing the accuracy of training in Fig. 2 (b) that AEBd provides greater improvement over than competitors. Especially, AEBd shows the obvious advantages as the number of feature selection augments.

Fig. 2

Results of feature selection.

Figure 3 displays the results of feature selection on every digit. Results show that AEBd outperforms the three competitors for every digit. These results imply the clear benefit of feature selection using AEBd, that is, as the number of selected features increases, the accuracy of feature selection using AEBd is higher than that of using SAE, Stacked AE and BAE. As a result, AEBd demonstrates the utility of feature selection on digits. In addition, BAE shows better performance than both SAE and Stacked AE. This also demonstrates that autoencoders combined with probabilistic correction methods, e.g., AEBd and BAE, are more valuable than stacked architectures or adding constraints to autoencoders, e..g, Stacked AE and SAE. More importantly, all these trials used only a small fraction of the available data for training, thus validating the procedure for limited data situations.

Fig. 3

Effects on feature selection.

5.2 Comparison of dimensionality reduction

We give the ROC and the corresponding AUC in different dimensionality for the four methods, in Fig. 4. To intuitively observe different dimensionality, the results that dimensionality is equal to10 are visualized in Fig. 5. Several observations can be obtained down below in Fig. 4 and Fig. 5.

Fig. 4

Accuracy of dimensionality reduction.

Fig. 5

Visualization results. (Dimensionality = 10).

(i) In different dimensionality, the accuracy of AEBd outperforms that of the three comparison methods. In addition, BAE is better than SAE and Stacked AE in terms of accuracy. This demonstrates the advantages of autoencoders combined by probabilistic correction methods.

(ii) The more dimensionality is reduced, the more method accuracy drops. In spite of this, the downward in precision during dimensionality reduction using the proposed AEBd is slower than that of using comparison methods.

(iii) Autoencoders combined with probabilistic correction methods are more helpful to improve the results on feature selection than stacked architectures or adding constraints to autoencoders. For stacked autoencoders and sparse autoencoders, the former is more suitable for large-scale feature selection, while the latter is better for selecting a smaller number of features.

Using autoencoders to feature selection, we can not only modify autoencoders architectures, such as stacked autoencoders, certainly, but also add constraints, such as sparse autoencoders. Using these manners is difficult to avoid feature misjudgments during selecting. However, Bayesian methods are used in autoencoders, thus greatly increasing the precision of feature selection. This is also because our method outperforms the three competition methods.

6 Conclusion

In this work, we combined an autoencoder and Bayesian methods, aiming to select features that offer useful features for machine learning tasks while filtering irrelevant and unimportant features. Compared with the previously existing feature selection approaches, the proposed method outperforms them. Here, the values of our method provides a theoretical way to analyze the optimality of feature selection. In the future, we will continue to explore the approach as regards feature selection.

Footnotes

Acknowledgments

This work was supported by the Teaching Reform Research Program of Chongqing Municipal Education Commission of China under Grant 203604. And theAssociation Scientific Research Program of Chongqing Municipal Education Commission of China under Grant CQGJ19B139

References

Tang

, Kay

and He

Toward Optimal Feature Selection in Naive Bayes for Text Categorization [J], IEEE Transactions on Knowledge and Data Engineering 28(9) (2016), 2508–2521.

Zhang

, Chan

P.P.K.

, Biggio

, et al., Adversarial Feature Selection against Evasion Attacks [J], IEEE Transactions on Cybernetics 6(3) (2014), 766–777.

Xue

, Zhang

M.B.

, et al., A Survey on Evolutionary Computation Approaches to Feature Selection [J], IEEE Transactions on Evolutionary Computation 20(4) (2015), 606–626.

Nag

Pa and N.R. A Multiobjective Genetic Programming-Based Ensemble for Simultaneous Feature Selection and Classification[J], IEEE Transactions on Cybernetics 46(2) (2015), 499–510.

, Cheng

, Wang

, Morstatter , et al., Feature Selection: A Data Perspective [J], ACM Computing Surveys 50(6) (2016), 94.

Zhang

, Mei

C.L.

, Chen

D.G.

, et al., Feature Selection in Mixed Data: A Method using a Novel Fuzzy Rough Set-Based Information Entropy [J],(1), Pattern Recognition 56 (2016), 1–15.

Han

, Yang

, Yan

, et al., Semisupervised Feature Selection via Spline Regression for Video Semantic Recognition [J], IEEE Transactions Neural Network. Learning System 26(2) (2015), 252–264.

, Si

, Zhou

, et al., FREL: A Stable Feature Selection Algorithm [J], IEEE Transactions Neural Network. Learning System 26(7) (2015), 1388–1402.

Tao

, Hou

, Nie

, et al., Effective Discriminative Feature Selection With Nontrivial Solution [J], IEEE Transactions Neural Network Learning System 27(4) (2016), 796–808.

10.

Luo

, Nie

, Chang

, et al., Adaptive Unsupervised Feature Selection With Structure Regularization [J], IEEE Transactions Neural Network. Learning System 29(4) (2018), 944–956.

11.

Armanfard

, Nie

, James Reilly

and Komeili,

Local Feature Selection for Data Classification [J], IEEE Transactions on Pattern Analysis and Machine Intelligence 38(6) (2016), 1217–1227.

12.

Gui

, Sun

, Ji

, et al., Feature Selection Based on Structured Sparsity: A Comprehensive Study [J], IEEE Transactions Neural Network Learning System 28(7) (2017), 1490–1507.

13.

Wang

, Wang

and Chang

Feature selection methods for big data bioinformatics: A survey from the search perspective [J], Methods 111 (2016), 21–31.

14.

Chakraborty

and Pal

N.R.

Feature Selection Using a Neural Framework With Controlled Redundancy [J], IEEE Transactions Neural Network Learning System 26(1) (2015), 35–50.

15.

Chin

A.J.

, Mirzal

, Haron

, et al., Supervised, Unsupervised and Semi-supervised Feature Selection: A Review on Gene Selection [J], IEEE Transactions on Computational Biology and Bioinformatics 13(5) (2016), 971–989.

16.

Lazar

, Taminau

, Meganck

, et al., A survey on filter techniques for feature selection in gene expression microarray analysis [J], IEEE-ACM Transactions on Computational Biology and Bioinformatics 9(4) (2012), 1106–1119.

17.

Kabir

M.M.

, Islam

M.M.

and Murase

A new wrapper feature selection approach using neural network [J], Neurocomputing 73(16-18) (2010), 3273–3283.

18.

Wang

, Tang

and Liu

Embedded unsupervised feature selection [C], In Proc. Twenty-Ninth AAAI Conf Artif Intell, IEEE, (2015), 470–476.

19.

Liu

, Ye

and Fujimaki

Forward-backward greedy algorithms for general convex smooth functions over a cardinality constraint [C], In Proc. 31st Int. Conf. Mach. Learn., IEEE, 2014, 503–511.

20.

Nag

and Pal

N.R.

A Multiobjective Genetic Programming-Based Ensemble for Simultaneous Feature Selection and Classification [J], IEEE Transactions on Cybernetics 46(2) (2016), 499–510.

21.

Diao

, Chao

, Peng

, et al., Feature Selection Inspired Classifier Ensemble Reduction[J], IEEE Transactions on Cybernetics 44(8) (2017), 1259–1268.

22.

Hsu

W.H.

Genetic Wrappers for Feature Selection in Decision Tree Induction and Variable Ordering in Bayesian Network Structure Learning [J], Information Sciences 163 (2004), 103–122.

23.

Guyon

, Weston

, Barnhill

, et al., Gene Selection for Cancer Classification using Support Vector Machines [J], Machine Learning 46 (2002), 389–422.

24.

Brown

, Pocock

, Zhao

M.-J.

, et al., Conditional likelihood maximisation: A unifying framework for information theoretic feature selection [J], The Journal of Machine Learning Research 13 (2012), 27–66.

25.

Khushaba

R.N.

, Al-Ani

and Al-Jumaily

Feature subset selection using differential evolution and a statistical repair mechanism [J], Expert Systems with Applications 38(9) (2011), 11515–11526.

26.

Chakraborty

and Pal

N.R.

Feature Selection Using a Neural Framework With Controlled Redundancy [J], IEEE Transactions Neural Networks Learning System 26(1) (2015), 35–50.

27.

Sun

, Huang

S.H.

, Wong

D.S.

, et al., Design and Application of a Variable Selection Method for Multilayer Perceptron Neural Network With LASSO [J], IEEE Transactions Neural Networks Learning System 28(6) (2017), 1386–1396.

28.

Wang

, Cai

, Chang

, et al., Convergence analyses on sparse feedforward neural networks via group lasso regularization [J], Information Sciences 381 (2017), 250–269.

29.

Wang

, Xu

, Yang

, et al., A Novel Pruning Algorithm for Smoothing Feedforward Neural Networks Based on Group Lasso Method [J], IEEE Transactions Neural Networks Learning System 29(5) (2018), 2012–2024.

30.

Scardapane

, Comminiello

, Hussain

, et al., Group sparse regularization for deep neural networks [J], Neurocomputing 241 (2017), 81–89.

31.

Bengio

, Yao

, Alain

, et al., Generalized denoising auto-encoders as generative models [C], In Advances in Neural Information Processing Systems, IEEE, 2013, 899–907.

32.

Bengio

, Courville

and Vincent

Representation learning:A review and new perspectives [J], IEEE Transactions on Pattern Analysis and Machine Intelligence 35(8) (2013), 1798–1828.

33.

Sarah

, Erfani,

, Rajasegarar,

, Karunasekera , et al., High-dimensional and large-scale anomaly detection using a linear one-class SVM with deep learning [J], Pattern Recognition 58 (2016), 121–134.

34.

Zhao

, Hu

and Wang

Heterogeneous Feature Selection with Multi-Modal Deep Neural Networks and Sparse Group Lasso [J], IEEE Transactions on Multimedia 17 (2015), 1936–1948.

35.

Theis

, Shi

and Cunningham

Cunningham, Lossy image compression with compressive autoencoders [C]. In International Conference on Learning Representations, IEEE, 2017.

36.

Doersch

Tutorial on variational autoencoders [J], arXiv preprint 1606.05908 (2016).

37.

Zeiler

M.D.

, Ranzato

M.A.

, Monga

, et al., On rectified linear units for speech processing [C], In Acoustics, Speech and Signal Processing (ICASSP) 2013 IEEE International Conference on, IEEE, 2013, 3517–3521.

38.

Yan

and Yang

Sparse discriminative feature selection [J], Pattern Recognition 48(5) (2015), 1827–1835.

39.

Cong

, Wang

, Liu

, et al., Deep sparse feature selection for computer aided endoscopy diagnosis [J], Pattern Recognition 48(3) (2015), 907–917.

40.

Shin

H.C.

, Orton

M.R.

, Collins

D.J.

, et al., Stacked autoencoders for unsupervised feature learning and multiple organ detection in a pilot study using 4D patient data [J], IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (2013), 1930–1943.

41.

Deng

, Zhang

, Marchi

and Schuller

Sparse autoencoder-based feature transfer learning for speech emotion recognition [C], In Affective Computing and Intelligent Interaction (ACII), 2013 Humaine Association Conference on, pp. 511-516, 2013.

42.

Tang

, He

, Paul Baggenstoss,

, et al., A Bayesian Classification Approach Using Class-Specific Features for Text Categorization [J], IEEE Transactions on Knowledge and Data Engineering 28(6) (2016), 1602–1606.

43.

Akkasi

and Varoğlu,

Improving Biochemical Named Entity Recognition Using PSO Classifier Selection and Bayesian Combination Methods [J], IEEE/ACM Transactions on Computational Biology and Bioinformatics 14(6) (2017), 1327–1338.

44.

Zhao

, Kang

and Long

Bayesian Multiresolution Variable Selection for Ultra-High Dimensional Neuroimaging Data [J], IEEE/ACM Transactions on Computational Biology and Bioinformatics 15(2) (2018), 537–550.

45.

Nishino

and Inaba

Bayesian AutoEncoder:Generation of Bayesian Networks with Hidden Nodes for Features [C], Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16), IEEE, 2016, 4244–4245.

46.

Kim

Y.-S.

, Walls

, Krafft

, et al., A Bayesian cognition approach to improve data visualization [C], In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, ACM, 2019.

47.

Griffiths

T.L.

and Tenenbaum

J.B.

Optimal predictions in everyday cognition [J], Psychological science 17(9) (2006), 767–773.

48.

Sobel

D.M.

, Tenenbaum

J.B.

and Gopnik

Children’s causal inferences from indirect evidence: Backwards blocking and bayesian reasoning in preschoolers [J], Cognitive science 28(3) (2004), 303–333.

49.

Steyvers

, Tenenbaum

J.B.

, Wagenmakers

E.-J.

, et al., Inferring causal networks from observations and interventions [J], Cognitive Science 27(3) (2003), 453–489.

50.

Tenenbaum

J.B.

, Griffiths

T.L.

and Kemp

Theory-based bayesian models of inductive learning and reasoning [J], Trends in Cognitive Sciences 10(7) (2006), 309–318.

51.

Suk

H.-I.

, Lee

S.-W.

, Shen

, et al., Latent feature representation with stacked auto-encoder for AD/MCI diagnosis [J], Brain Structure and Function (2013), 1–19.

52.

Titterington

Comparison of discriminant techniques applied to a complex data set of head injured patients [J], J. Royal Statistical Society 144(2) (1981), 145–175.

53.

Le Cun, , Bengio

and Hinton,

Deep learning [J], Nature 521 (2015), 436–444.

54.

Lusch

, Nathan Kutz

and Steven Brunton

Deep learning for universal linear embeddings of nonlinear dynamics [J], Nature Communications 9 (2018), 1–10.

55.

Segler

M.H.S.

, Preuss

and Waller

M.P.

Planning chemical syntheses with deep neural networks and symbolic AI [J], Nature 555 (2018), 604–610.