Abstract
A Deep Belief Network (DBN) is a generative probabilistic graphical model that contains many layers of hidden variables and has excelled among deep learning approaches. DBN can extract suitable features, but improving these networks for obtaining features with more discrimination ability is an important issue. One of the important improvements is sparsity in hidden units. In sparse representation, we have the property that learned features can be interpreted, i.e., correspond to meaningful aspects of input, and are more efficient. One of the main problems in sparsity techniques is to find the best hyper-parameters values which need dozens of experiments to obtain them. In this paper, a dynamic hyper-parameter value setting is proposed for resolving this problem. This proposed method does not need to set parameters manually. According to the results, our new dynamic method achieves acceptable recognition accuracy on test sets in different applications, including image, speech and text. According to these experiments, the proposed method can find hyper-parameters dynamically without losing much accuracy.
Keywords
Introduction
Nowadays, the use of deep networks has been highly considered by researchers as one of the most powerful tools in machine learning in different applications [3]. Although using neural networks has been common from many years ago, the increase in computing power and new techniques in machine learning, have provided the possibility of using these networks with more layers.
Among different methods of deep learning techniques [3], Deep Belief Network (DBN) [7] is one of the most famous and most efficient techniques. The DBN uses a greedy layer-wise unsupervised learning algorithm [2], where each layer is a Restricted Boltzmann Machine (RBM). Although each layer in a DBN can extract useful features, new researchers use sparsity methods in hidden layers for extracting more beneficial features with more generalization [1].
Sparsity methods are used to decrease activation values of hidden units [6]. Since different methods are proposed as sparsity methods to be used in an RBM, these methods are highly dependent on the specified hyper parameters [11], and finding the best hyper parameters is very time consuming. In this paper, we propose a new method for dynamic adjustment of hyper parameters. This new method resolves the issue of finding the best hyper parameters. The proposed method uses the normal sparse RBM method and finds its hyper-parameters dynamically, while the other sparse methods need dozens of experiments to find the best hyper-parameters.
The rest of this paper is organized as follows: in Section 2, RBM and DBN are described. The related works about sparse RBM methods are in Section 3 and our proposed method is presented in Section 4. In Section 5, some experiments are conducted and the proposed method is compared with some other methods in the tasks of digit recognition on the MNIST dataset, spoken letter recognition on the ISOLET dataset and document topic classification on the 20 Newsgroups dataset. Finally, Section 6 concludes the paper.
Background
DBNs are composed of multiple layers of RBMs. Outputs of hidden layers in each RBM can be considered as input for the next RBM layer. As shown in Fig. 1, with this method, DBN will be trained layer by layer [7].
Stack of RBM’s in which the samples from the lower-level RBM are used as the data for training the next RBM [18].
An RBM is a Markov random field with two groups of hidden and visible units (see Fig. 2). In RBMs, each neuron is connected to all the neurons in the other layer. However, there are no connections between neurons in the same layer [5]. This restriction in RBMs causes conditional independence between visible and hidden units.
An example of Restricted Boltzmann Machines [19].
The joint probability distribution under the model uses an energy function of
where
According to the energy function, the joint probability distribution of the RBM model with visible and hidden units is defined as follows:
where
where
Training in RBMs maximizes probability distribution in training data with respect to the model parameters:
where the parameter
The expectation
Different methods are proposed to build sparse RBMs [8, 11, 13, 17]. Essentially, RBMs learn non-sparse distributed representations. In all the proposed methods, the learning algorithm in RBM has been changed to enforce RBM to learn sparse representation. For this purpose, a regularization term (
where
Update the parameters (W, a, b) using approximation to the gradient of the log likelihood like CD, PCD or FEPCD. Update the parameters (W, a, b) using the gradient of the regularization term. Repeat steps 1 and 2 until convergence or reach the max epoch.
Different researchers have used different regularization terms. One of them penalizes quadratically as a deviation of the expected activation of the hidden units from a (low) fixed level
where
In another state of the art paper based on rate distortion theory, the penalty factor is the activation probability of hidden units [8]. We call the method using this penalty factor the rate distortion sparse RBM or rdsRBM method. This regularization term is presented as follows:
where
Our proposed method uses normal probability density function as the regularization term [11]. The normal regularization term has different behavior according to deviation of the activation of the hidden units from a (low) fixed level
where
The normal regularization term has a parameter that can control the force degree of sparseness. According to Fig. 3 by different variance values, we can control the penalty value or degree of sparseness.
Normal pdf with 
Therefore, the RBM parameters are updated with the regularization term based on the following equations by computing the gradient of our regularization term [11]:
where
where
Similarly, for hidden bias we have the following equation:
The main problem in all of the described sparsity methods is to find the appropriate values for hyper-parameters in the sparsity regularization term. Based on experiments, all these methods are very sensitive to hyper-parameters and finding appropriate values for them is very time consuming [11]. For example, in normal sparse RBM, three hyper-parameters (sparsity cost, sparsity target and variance) must be used and finding appropriate values for them requires about 100 different runs of experiments.
In this paper, based on features of nsRBM we proposed a new dynamic method for adjusting hyper-parameters. This method does not need different experiments to find appropriate values of hyper-parameters. We call this method the dynamic normal sparse RBM or dynNsRBM.
The first hyper-parameter in nsRBM is sparsity target (
where
The other hyper-parameter in the nsRBM method is the variance of normal function (
where
The last hyper-parameter in the nsRBM method is sparsity cost (
Below is a pseudocode for the description of the dynamic sparse RBM method. For simplicity, computation of some parameters such as momentum and weight decay have been excluded.
According to the pseudocode, in the first stage some parameters of RBM learning will be initialized. In this initialization,
Then in the main loop for training dynamic sparse RBM, positive and negative samples are obtained. These samples are used for computing the gradient of log likelihood term in Eq. (7). Afterwards, the gradient of
Program parameters:
Setting learning rate and sparsity cost (
Initialization:
Initialize Initialize the 30 Markov Chains
Then repeat:
Get the next batch of training data, Calculate Calculate Update the parameters ( Calculate the dynamic sparse parameters (
Update the parameters (
For a better understanding of the dynamic sparsity method, a histogram of activation probability of hidden units and its regularization term on the ISOLET dataset have been shown in Fig. 4. Although the average activation probability of hidden units are about 0.5, the normal function parameters have been set dynamically for the average activation probability of hidden units to tend towards zero (by adjusting sparsity target less than the average activation probability of hidden units). Also, the selected variance can have an appropriate effect on all hidden units.
Histogram of activation probability of hidden units and its regularization term. Although the average activation probability of hidden units are about 0.5, the normal function parameters have been set dynamically for the average activation probability of hidden units to tend towards zero (by adjusting sparsity target to less than the average activation probability of hidden units).
The reduction in the average activation probability of hidden units and the dynamic change of regularization terms during learning are shown in Fig. 5. According to this figure, the normal function has been moved toward zero due to a decrease in the activation probability of hidden units during learning. It should be noted that in a classic RBM, the activation probability of hidden units does not change.
Reduction in average activation probability of hidden units and dynamic change of regularization term during learning. In this figure, the normal function has been moved to zero due to decreasing activation probability of hidden units during learning.
The method proposed in this paper was evaluated on different datasets with different applications and input types. These datasets are in the tasks of digit recognition on the MNIST dataset with discrete feature values between 0 and 255, spoken letter recognition on the ISOLET dataset with real feature values and document topic classification on the 20 Newsgroups dataset with binary feature values. In these experiments, we used the DeeBNet toolbox [10] (the toolbox implemented by the authors). In addition, we used the new FEPCD method [17] to approximate the gradient of the log likelihood.
MNIST dataset
MNIST1
Available online at “http://yann.lecun.com/exdb/mnist/”.
For better comparison and to demonstrate the superiority of this method in feature learning application, several experiments were conducted. In these experiments, different sparsity methods (qsRBM and rdsRBM, nsRBM and dynNsRBM) are compared with classic RBM, principal component analysis and raw features. Also, all models have been trained on 20,000 training samples of the MNIST dataset (similar to [8]) to extract 500 features from visible data (except in raw features where all input features are used).
Now with these learnt models, features of 10, 20, 50, 100, 500 and 1000 images per class are used in training a linear classifier. The reason for using a simple linear classifier is for a better comparison between the strength of learnt features. Also, for better classifier evaluation, image selection in each class and linear classifier training were done twenty times separately and the average results were reported.
Digit recognition error rates on the MNIST dataset on training (with 10, 20, 50, 100, 500 and 1000 samples per class) and test sets obtained by a linear classifier that was trained on raw data, transformed codes produced by PCA and active probability of hidden units produced by RBM, qsRBM, rdsRBM, nsRBM and dynNsRBM.
In Fig. 6, the recognition performance based on raw data (784 features), transformed codes produced by PCA (500 features), RBM, qsRBM, rdsRBM, nsRBM and dynNsRBM (all with 500 hidden units) are compared. In this figure, the results of the qsRBM, rdsRBM and nsRBM are presented according to best parameter combinations that were obtained by dozens of experiments [11]. For brevity, only the best model results are reported. In dynNsRBM, the parameters were obtained dynamically.
According to Fig. 6, interesting results can be observed. The first interesting result is that using all types of RBMs in feature learning, the error rate decreases considerably. This improvementin the small number of training samples is more evident. The second interesting result is that all types of sparse RBMs have extracted better features with more discrimination and they have less error rate than classic RBM. The third interesting result is that even though finding the hyper parameter values in the proposed method is dynamic, the performance is close to the best results in nsRBM and in some cases the error is less. However, it is noteworthy that these error rates are achieved by dozens of experiments to find the best hyper-parameter values, but in the proposed method, the hyper-parameter values are obtained dynamically at runtime.
Digit recognition error rates on MNIST training (with 10, 20, 50, 100, 500 and 1000 samples per class) and test sets obtained by a linear classifier that was trained on active probability of hidden units produced by the first and second layers in DBN, qsDBN, rdsDBN, nsDBN and dynNsDBN.
Since a DBN can learn higher levels of representation in its layers [1], in the next experiments, a layer with 100 hidden neurons has been added to the last models (RBM, qsRBM, rdsRBM, nsRBM and dynNsRBM). After learning new models with their learning methods, a feature vector of learnt features in the first and second layers has been used for training the linear classifier (500 and 100 features in the first and second layers, respectively). Figure 7 depicts the error rate on the test set. According to Fig. 7, it can be seen that all types of sparse DBNs have extracted better features with more discrimination and they have less error rates than the classic DBN. In addition, the proposed method performance is close to the best results in nsDBN and in some cases, the error is less.
Spoken letter recognition error rates on ISOLET dataset on training (with 10, 20, 50 and 100 samples per class) and test sets obtained by a linear classifier that was trained on raw data, transformed codes produced by PCA and active probability of hidden units produced by RBM, qsRBM, rdsRBM, nsRBM and dynNsRBM.
In the ISOLET2
Available online at “https://archive.ics.uci.edu/ml/datasets/ISOLET”.
Similar to the MNIST test, different sparsity methods (qsRBM and rdsRBM, nsRBM and dynNsRBM) are compared with classic RBM, principal component analysis and raw features. All models were trained on 3,640 feature vectors (i.e. 140 feature vectors per class) of the ISOLET dataset to extract 500 features from visible data (except in raw features where all input features were used).
Now with these learnt models, features of 10, 20, 50 and 100 samples per class are used in training a linear classifier. For each combination of training data size and algorithm, we trained 20 classifiers with randomly chosen training sets for the linear classifier and then used the average classification error to evaluate the performance of the corresponding algorithm.
In Fig. 8, the recognition performance based on raw data (617 features), transformed codes produced by PCA (500 features), RBM, qsRBM, rdsRBM, nsRBM and dynNsRBM (all with 500 hidden units) are compared. In this figure, the results of the qsRBM, rdsRBM and nsRBM are presented according to best parameter combinations obtained by dozens of experiments. In dynNsRBM, the parameters were obtained dynamically. According to these results, even though finding the hyper parameter values in the proposed method is dynamic, the performance is close to the other sparse techniques.
Similar to experiments done on MNIST, in order to compare the recognition ability of representations learnt by DBN, qsDBN, rdsDBN, nsDBN and dynNsDBN, we trained a layer with 100 hidden neurons. After learning new models with their learning methods, a feature vector of learnt features in the first and second layers was used for training the linear classifier (500 and 100 features in the first and second layer, respectively). Figure 9 depicts the error rates on the test set. According to Fig. 9, it can be seen that even though finding the hyper parameter values in the proposed method is dynamic, the performance is close to the other sparse techniques.
Spoken letter recognition error rates on ISOLET training (with 10, 20, 50 and 100 samples per class) and test sets, obtained by a linear classifier that was trained on active probability of hidden units produced by the first and second layers in DBN, qsDBN, rdsDBN, nsDBN and dynNsRBM.
The 20 Newsgroups3
Available online at “http://qwone.com/
For a text classification experiment, we used a version of the 20 Newsgroups4
Available online at “http://qwone.com/
Document topic classification error rates on 20 Newsgroups dataset on training (with 10, 20, 50, 100, 200 and 350 samples per class) and test sets obtained by a linear classifier that was trained on raw data, transformed codes produced by PCA and active probability of hidden units produced by RBM, qsRBM, rdsRBM, nsRBM and dynNsRBM.
Similar to the MNIST test, different sparsity methods (qsRBM and rdsRBM, nsRBM and dynNsRBM) are compared with classic RBM, principal component analysis and raw features. All models are trained on 7,500 feature vectors (i.e. 375 feature vectors per class) of the 20 Newsgroups dataset to extract 500 features from visible data (except in raw features where all input features are used).
Document topic classification error rates on 20 Newsgroups dataset on training (with 10, 20, 50, 100, 200 and 350 samples per class) and test sets, obtained by a linear classifier that was trained on active probability of hidden units produced by the first and second layer in DBN, qsDBN, rdsDBN, nsDBN and dynNsDBN.
Now with these learnt models, features of 10, 20, 50,100, 200 and 350 feature vectors per class are used in training a linear classifier. For each combination of training data size and algorithm, we trained 20 classifiers with randomly chosen training sets for the linear classifier and then used the average classification error to evaluate the performance of the corresponding algorithm.
In Fig. 10, the recognition performance based on raw data (5000 features), transformed codes produced by PCA (500 features), RBM, qsRBM, rdsRBM, nsRBM and dynNsRBM (all with 500 hidden units) are compared. In this figure, the results of the qsRBM, rdsRBM and nsRBM are presented according to the best parameter combinations that were obtained by dozens of experiments. However, in dynNsRBM, the parameters were obtained dynamically. According to the results, almost for all types of RBMs used in feature learning, the error rate decreases considerably. This improvementin the small number of training samples is more evident and even though finding the hyper parameter values in the proposed method is dynamic, the performance is close to the other sparse techniques.
Similar to experiments done on MNIST, in order to compare the recognition ability of representations learnt by DBN, qsDBN, rdsDBN, nsDBN and dynNsDBN, we trained a layer with 100 hidden neurons. After learning new models with their learning methods, a feature vector of learnt features in first and second layers were used for training the linear classifier (500 and 100 features in first and second layers respectively). Figure 11 displays the error rate on the test set. According to Fig. 11, it can be seen that even though finding the hyper parameter values in the proposed method is dynamic, the performance is close to the other sparse techniques.
In all of RBM experiments, we performed two-sample t-test and Wilcoxon test scores for our new method and nsRBM results to show that the average of error in our new method and Normal Sparse RBM method are similar.
In this paper, we presented a new sparse RBM method. Different methods are proposed to build sparse RBMs. The main problem in all of these sparsity methods is to find the appropriate values of hyper-parameters in sparsity regularization term. Based on experiments, these methods are very sensitive to hyper-parameters and finding their appropriate values is very time consuming.
In the proposed method in this paper, based on the features of nsRBM, we proposed a new dynamic method for adjusting hyper-parameters that does not require different experiments to find the appropriate values of hyper-parameters.
According to the results, in our new method, finding the hyper parameter values is dynamic and the performance is close to the best results in other sparse methods and in some cases, the error is less. However, it is noteworthy to mention that error rates on other sparse methods are obtained by dozens of experiments to find the best hyper-parameter values, but in the proposed method, the hyper-parameter values are obtained dynamically at runtime.
