Abstract
With the advancement of technology and expansion of broadcasting around the globe has further boost up biometric surveillance systems. Pattern recognition is the key track in this area. Convolution neural network (CNN) as one of the most prevalent deep learning algorithm has gain high reputation in image features extraction. In this paper, we propose few new twists of unsupervised learning i.e. convolution sparse filter learning (CSFL) to obtain rich and discriminative features of an image. The features extracted by CSFL algorithm are used to initialize the first CNN layer, and then these features are further used in feed forward manner by the CNN to learn high level features for classification. The linear regression classifier (softmax classifier) is used to serve as the output layer of CNN for providing the probability of an image class. We present and examine five different architectures of CNN and error function mean square error (MSE). The experimental results on a public dataset showcase the merit of the proposed method.
Introduction
Convolution neural network (CNN) has a diverse range of applications in the field of visual recognition i.e. handwritten digits recognition [17,31], character recognition [19], face detection [55], face recognition [12,28], analysis of facial expression [54] and car detection [64,66].
Many efforts have been done in this regard to address the superiority of neural network over manual visual recognition tasks [1,4,9,11,14,35,51,59,62,65,68,69,71]. The key strength of neural network is that, they are able to change its weight accordingly in order to accomplish the desire output through back propagation, and hence proved competent in situations where analytic elucidation is difficult to attain. However, the main problems which restricted neural network applications in real life tasks are; the computational power requirements, high amount of training data and large memory dependency.
In recent years, many researchers tried to overcome the issue by providing a swift and robust models for neural network. Unsupervised pre-training of CNN [23,26,32] is among them. Onis et al. [43] presented convolutional PCA for object detection by restricting the dependency on training dataset. The authors of [18] proposed PCA based CNN which composed of numerous feature extraction stages and one nonlinear output stage. In their CNN architecture, the filter banks of convolution layer were learned through PCA, and in the nonlinear output stage, binary hashing is applied to solve the detection problem.

The proposed method flowchart. The flow chart explains the combine supervised method (through CNN) and unsupervised method (CSFL) achieved for face recognition task.
Li et al. [34] raised the approach for improving the efficiency of traditional CNN. They utilize pre-training of convolution kernels through PCA, which are independent on any other training algorithm such as back propagation in the convolution layers. Huang et al. [23] proposed an unsupervised learning algorithm in object recognition task that is invariant to shifts and distortions. Their system consists of multiple convolution filters, followed by a point wise sigmoid non-linearity, and a feature pooling layer. Similarly, Baccouche et al. [3] handled the shift-invariance for deep convolution neural network, by addition of an extra hidden variable to the objective function. Precisely, a convolutional sparse auto encoder is trained for getting such functionality.
Nair et al. [38] and Hajinoroozi et al. [20] replaced the convolutional filters with restricted Boltzmann machine (RBM) used in traditional CNN, so that the systems can perform more efficient and fast electroencephalography (EEG) based signal classification. However, these systems does not guarantee fast convergence and less computation of CNN training algorithm. Also, the change in weights are not optimized to prevent the system from being stuck in the local minima.
In this paper, we propose a novel unsupervised pre training approach for CNN [47] to improve a specific recognition task i.e. the face recognition. In the face recognition task, we develop a hybrid system which has the advantage of unsupervised filter learning. Our proposed method outline has been shown in Fig. 1. Figure 1 shows that in order to capture rich and discriminative information of faces, the convolution sparse filter learning (CSFL) is employed to learn the filters of the network with a large number of unlabeled faces and non-faces images. The softmax classifier layer is used as the output layer. For a given face/non-face image, the network is able to automatically learns good features to represent the face and output the probability of each type i.e. face or non-face.
The remaining paper is structure as follows. Section 2 describes comprehensive explanation of the related state of the art work. The architecture of CNN is enlightened in Section 3. The proposed approach is explained in Section 4. Experimental results and the comparison of different optimal algorithms are discussed in Section 5. In the end, some conclusion remarks and future guidelines are given in Section 6.
Face is considered to be the most enrich features part of the human body. Meanwhile, face is not a big part in the human body and there are more than 7 billion human beings in the world, which can be easily differentiated through their face patterns. This shows the influence of face on other body parts, since face contains rich and discriminative features which needs to be extracted for learning to provide useful categorization. Therefore, face detection is the key hot topic of the day, due to its wide range of applications from daily life to industry i.e. security access mechanism, biometric identification, network-based video coding, content related video indexing and cutting-edge human and computer communication [8,45–47,50,57,60,70]. Therefore, our proposed CNN approach use face and skin detection as the dataset.
Many recent works have been proposed to boost-up the efficiency of face recognition task. These systems extract features through filter banks (usually based on oriented edge detectors) and non-linear procedure (normalization, quantization, sparsification, winner-take-all, point-wise saturation). Numerous recognition systems first use a features extraction phase followed by a supervised classifier. Specific examples of such systems in the features extraction stage use HoG [10], SIFT features [29,36], Geometric Blur [7], and inspired models of the mammalian visual cortex [49].
Other models use such feature extractors, in two or more consecutive phases, followed by a supervised classifier. This includes CNN totally trained in purely supervised manner with gradient descent [24], or with an auxiliary task [2], or trained in purely unsupervised manner [23,26,32]. Also, Multi-stage systems comprise HMAX-type architectures [37,58] in which the first layer is hardwired with Gabor filters, and the second layer is trained in unsupervised manner by keeping randomly selected output features from the primary phase into filters of the next phase.
All the aforementioned models basically differ from each other by different number of feature extraction stages i.e. one or two (or more), the selection of the filters (supervised, unsupervised, hardwired), by the kind of non-linearity used after the filter banks, and the top-level classifier (linear non-linear or more sophisticated).
Therefore, in this paper we propose a hybrid architecture for CNN, inspired from unsupervised pre training [6,22,25,27,63]. Combing both approaches (supervised and unsupervised) will have much refined results, since it emphasizes natural phenomenon (as human use both supervised and unsupervised learning). The key advantage of the proposed unsupervised pre-filter learning approach CSFL is that, it uses the sparse function
In comparison to the existing literatures on face recognition, the key research investigations of this paper are; First, to determine the superiority of deeper CNN over shallow CNN. Second, to develop a robust unsupervised pre-filter learning algorithm for CNN. In this regards, we make the following major contributions: Initially, we develop different architectures of CNN to investigate the role of each layer in face classification task. Then, we develop a new learning algorithm called convolution sparse filter learning (CSFL) for facial classification problem, which is able to capture rich and discriminative features, and helps the training algorithm in weights optimization. Also, it swiftly converges and gives optimize solution due to effortlessly learning of feature representations that are well matched for a range of tasks, comprising object classification. Finally, experimental results shows that the proposed algorithm outperforms previous benchmark algorithms and helps in improving classification efficiency up to 2%.
It is also an obligatory initial stage of face recognition and expression study. Which opens the gate to enrich, highlight, comprehend, and improve human computer interactions.

Convolution neural network design. Cnv1 represents the first convolution layer, SS2 serves as the second layer followed by Cnv3 and SS4 which serves as the 3rd and 4th layer of the architecture respectively. The last two layers are the last convolution layer (Cnv5) and output layer (FC6) of the system.
LeCun et al. [30] initially proposed the architecture of CNN in the early eighties, and then he along with his colleagues further revised it in the nineties. CNN is an inspired neural network system, based on three significant architectural concepts: weight sharing, local receptive fields, and subsampling in the time domain. The main idea behind CNN design was the recognition of two dimensional visual patterns. The key strength of CNN are: (1) feature extraction and classification are combined into one structure and completely adaptive, (2) the network extract 2-D image features in a feed forward manner, and (3) it is comparatively invariant to geometric and local falsehoods in the image.
Figure 2 shows the proposed architecture of CNN. Our proposed architecture encloses two stages i.e. unsupervised CSFL and supervised learning called convolution neural network [47], which produce low-level confined features and high-level comprehensive features respectively. The high-level comprehensive features provide general descriptions of the whole faces, and the low-level features purpose to describe different patches of the face specifically. Our proposed architecture is in feed-forward manner which incorporates the features learned in both unsupervised and supervised learning phases for categorization, in order to get complete gain of both low level and high level features.
Depth is a very important factor in CNN design. As the computation and memory requirements of CNN greatly increase with increasing the number of layers. Therefore, we choose an optimized architecture of six layers for CNN to avoid high computation and provide easiness in real-time implementation. The six layers of CNN consists of three convolutional layers i.e. “Cnv1”, “Cnv3” and “Cnv5” represents layer 1, 3 and 5 respectively. The two sub-sampling layers “SS2” and “SS4” represents layer 2 and 4 respectively and the only output layer called fully connected “FC6”. In the convolution layer, the CSFL emphasizes operative filters for the system. Average pooling operator is used by the subsampling layer to lessen the spatial resolution of input image. Feature extraction and categorization are merged into a single network and is completely functional. Also, 2-D image features are extracted at increasing feed-forwarded measure. Moreover, it is comparatively robust to local alterations and small shifts in the image [47]. The detail explanation of each CNN layer is presented below.
Convolution layer
In image data processing application, convolution layers provides non-linear mapping from low-level image depiction to high-level knowledge representation, which simulates the “simple cells” in the typical model of the visual cortex [16]. That’s why convolution layers are considered to be the back bone of CNN.
Each plane in a convolution layer is linked to one or several feature maps of the previous layer depends on the architecture of CNN. A connection is linked with a convolution mask, which is a 2-D matrix of adaptable weights. Weights are adjusted according to the desired output. In each plane, the convolution is computed between its 2-D inputs and convolution masks. Before passing through activation function, the summation of convolution outputs is obtained and after that it is added with an adaptable scalar term called bias. Finally, the planes output is obtained by applying the activation function on the result. Feature map is a 2-D matrix of plane output, it is called feature map on the fact that convolution output shows the existence of a visual feature at a given pixel position [47]. This can be mathematically expressed as:
Moreover, to make feature extraction effortless for convolutional layer we propose a new tweaking of unsupervised feature extraction. We demonstrate this by taking element wise sigmoid function
Subsampling layer
The key goal of subsampling layer is to make the representation robust to both geometric falsehoods and trivial alterations. The planes of sub-sampling layer are of similar number as that of preceding convolution layer. Moreover, when receiving a 2-D input from the preceding convolution layer it is divided into non-overlapping patches by subsampling plane. The summation of each 4 pixels block is calculated, before being added to a scalar term called bias. This summation is then multiplied by an adaptable weight for output optimization. The outcome is conceded through an activation function to yield an output of
Suppose that l represents a subsampling layer, and l belongs to an even integer i.e.
For a given subsampling layer l, its desired feature map m can be obtained as
In this paper, we use
In convolution layer exactly one previous feature map is connected to every solo plane. Size of the input feature map has precisely the same size as that of the convolution mask in that layer. Hence, one scalar output will produce from every single plane in the last convolution layer. At last, every plane outputs in this layer are then linked to the output layer.
Output layer
Sigmoidal neurons or radial-basis-function neurons usually construct the output layer [21]. Here, our main focus is on using sigmoidal neurons for the output layer. The network output depends on the output of this layer in other words this layer output served as the network output. In applications i.e. visual pattern classification, these outputs specify the class of the input image data. Let L be the output layer contains sigmoidal neurons and
Proposed approach
Two types of features should be learned in the proposed unsupervised CNN architecture. First, low level features are learned through CSFL, followed by high level features learning and classification through CNN. Since, the efficiency of CNN is highly dependent on the training algorithm which in-turn depends on the amount of input data. Therefore, the key role of the proposed CSFL approach is to help CNN training algorithm in weight optimization to avoid premature convergence and reduce input data dependency.
Convolution sparse filter learning (CSFL)
To the best of our knowledge, this is the first effort where an unsupervised neural network has been introduced to the face classification problem. The key objective of sparse filtering is that it is hyper-parameter free due to which it works well on a variety of data modalities without specific tuning for each modality. This permits effortlessly learning of feature representations that are well matched for a range of tasks, comprising object classification. The below Table 1 compare the hyper-parameter of the proposed filter learning method i.e. CSFL with other unsupervised learning methods i.e. sparse filtering, ICA, Sparse Coding, Sparse Auto-encoders and Sparse RBMs.
Comparison of the proposed filter learning method with different feature learning algorithms in term of adjustable hyper-parameter
Comparison of the proposed filter learning method with different feature learning algorithms in term of adjustable hyper-parameter
Table 1 shows the influence of the proposed filter learning method on other unsupervised learning algorithms due to confined hyper-parameter. Since, for other unsupervised learning algorithms, as the number of features increased, it took significantly longer to solve the L1-regularized least squares problem for finding the coefficients [40]. Similarly, Fig. 3 shows that sparse filtering is comparatively faster (4×) than other unsupervised learning algorithms i.e. ICA, sparse coding and Sparse Auto encoders for large data input dimension [40]. So the proposed CSFL helps the CNN algorithm in fast and optimize solution, compare to other unsupervised learning methods.

Convergence time comparison between different unsupervised learning algorithms over different input measure.
This subsection shows the implementation of CSFL algorithm introduced for CNN. The whole implementation process is clearly summarized in Algorithm 1. It is clear from Algorithm 1 that the proposed methodology is completely different from other unsupervised learning algorithms in the existing literatures.

Convolution sparse filter learning
Define a data matrix
We formulate the convolution sparse filter learning algorithm mathematically as:
Precisely, first each feature is normalized to be equally active by dividing its
Implementation of convolution sparse filter learning is easy, since it is hyper-parameter free. The optimization of (10) is achieved through L-BFGS method [41]. An approximation is introduced, since the objective function comprises absolute value operators which are non-differentiable. The objective function gradient is calculated by back-propagation method and the absolute value operators are flouted. Different filter of
Algorithm 1 summarizes the complete optimization process of the proposed CSFL. “Convergence” is defined as the objective function value difference in Eq. (9) smaller than threshold or one iterative time surpasses another threshold.
Suppose the training set has U input data matrix and U desired output representations. Suppose
Face window size of
First, as we developed an unsupervised CNN in which data passed through many intermediate processing layers between the input and output in a feedforward manner. The network is fed with some extra information of face shape, which helps the unsupervised CNN to learn more information and reduce the number of false alarms produced that are certainly to appear when only focused on the central part of face. Second, due to coarse manual labeling of facial points produce solid errors while cropping, as a result training process are affected with in the network. Since our proposed architecture works in feed forward way therefore small errors in manual labeling reinforce this ability by providing examples that are precisely un-stabilized. This shows that, great concentration is required while gathering training examples of face patterns alignment.
Our unsupervised CNN is trained on back-propagation RPROP algorithm [53] which is an efficient learning method that perform weight step alteration based on local gradient data. The key aim of using this training method in our unsupervised CNN is that, the weight adaptation is not blurred by gradient behavior at all. Let
Since weight is adjusted in each iteration/epoch to make actual output closer to the desire output. In each iteration, the algorithm jumps to local minima and the update value
The update-value decreased the weight, in-case of error is growing. Conversely, in case of negative derivative then the weight is increased by its update-value i.e. being added. This can be mathematically illustrated as:
Nevertheless, the previous update of weight is reverted if the minima was missed due to too large previous step. Mathematically:
Therefore, due to this backtracking weight step, the derivatives are supposed to alter its sign once more in the following step. So, to avoid this computational expense of the update-value, there should be no adaptation in the succeeding step of the update value by setting
Experimental results and discussion
The proposed unsupervised CNN is assessed on face and skin detection dataset [48]. In order to assess the influence of face classification on given dataset and to compare it with other optimal classification methods, a broad set of experiments are performed. Before probing into the analysis and evaluation parts, we initially explain the dataset groundwork method.
Dataset
The dataset used in our experiments is a complex and challenging dataset taken from face and skin detection dataset [48] containing 4000 color images that are varied in terms of contextual scenes, illumination conditions, and face and skin kinds. The illumination circumstances comprise indoor lighting and alfresco lighting. Out of the total 4000 images, 1931 images are taken in indoor lighting conditions, 1855 images are taken in alfresco lighting conditions, and the remaining 214 images are taken in other illumination conditions. The skin natures include yellowish, whitish, darkish and brownish skins, which can be categorized in the whole data set i.e. whitish and pink skin images are taken as 1665, yellowish and light brownish skin images are taken as 1402, reddish, darkish and dark brownish skin image are taken as 965, and 102 other skin types images are taken in to consideration.
We develop a training dataset of 2000 images, by taking one thousand face and one thousand non-face images from [48] as shown in Fig. 4 by fixing the size of individual image to

Face and non-face outlines of training sample.
In this sub-section, we describe the performance evaluation of the proposed CNN architecture on face and skin detection dataset. We evaluate two main components of CNN for experimental purpose i.e. depth and filters learning.
In order to investigate the system depth and filter learning contribution in the performance evaluation. We compare and investigate the performance of five different kinds of CNN architectures for experimental purpose. The aim of implementing these different architectures for comparison is to assess the superiority of the proposed architecture over state of the art algorithms. The four different architectures along with the proposed architecture are described as follow:
Training algorithm RPROP with different weight initialization parameters
Training algorithm RPROP with different weight initialization parameters
Mean square error (MSE) obtained at different training epochs by using several filter learning methods on face dataset [48]
Comparison results at different training epochs of various methods on the dataset in [48]
CNN-RN-4 is a standard 4 layer CNN architecture of face classification. In which, first convolution layer is initialized randomly by standard normal distribution having zero mean and variance one.
CNN-RN-6 [47] is a standard 6 layer CNN architecture of face classification. In which, first convolution layer is initialized randomly by standard normal distribution having zero mean and variance one.
CNN-SUD-6 is a standard 6 layers CNN architecture of face classification. In which, first convolution layer is initialized randomly using standard uniform distribution.
CNN-sparse filtering [52] is based on standard 6 layer CNN architecture. In which, first convolution layer is convolved with 2-D filters learned through sparse filtering.
The proposed model is based on standard 6 layer CNN architecture. CSFL served as the role of initial filter learning, which is further use by CNN for higher feature learning and classification.
All these CNN architectures are trained on a same training algorithm i.e. RPROP. Table 2 summarizes the training algorithm RPROP with different weight initialization learning methods taken for comparison. Each network is trained for 2000 epochs using the same training algorithm i.e. RPROP. In order to measure the performance of each network, we use ten folds cross validation of different parameters namely, MSE training, number of training epochs and training time for comparison.

Thus, Table 3 and Table 4 reveals the MSE obtained at different training epochs by using several filter learning methods. It is clear that when CNN is initialized with proposed model the RPROP algorithm reaches lower MSE quickly compared to all other initialization algorithms. For example, to reach MSE of 0.12, the RPROP takes only 100 epochs with the proposed algorithm, whereas RPROP needs 350 and 310 epochs with the approaches [47] and [52] respectively. Figure 5 shows that the proposed CSFL outperforms all other learning filter models in term of lower MSE. Hence we conclude that the filters obtained through CSFL are able to escape from local minima well compare to other standard methods. Furthermore, Table 5 and Table 6 shows the classification accuracy of all different models proposed for experimental purpose. It is clear that CNN-RN-6 and CNN-SUD-6 networks have similar classification rates of 98.30% and 97.90% respectively on the training dataset, and 94.83% and 93.16% on the testing dataset (the difference is no more than 0.8%). Whereas, the CNN-RN-4 network achieves the lowest classification rates (95.35% and 91.66%). The highest classification rates are achieve by the proposed method which are 99.52% and 97.78% on training and testing dataset respectively. These relative performances are consistent with the training speed comparison discussed earlier.
Classification accuracy versus different filter learning methods on face and skin dataset [48]
Classification accuracy of different stochastic methods on the dataset in [48]

Comparison on the standard face and skin detection dataset of training algorithm RPROP with different weight initialization. (a) Training MSE versus the number of training epochs, and (b) training MSE versus the training time.
CNN initialization with non-optimal filter learning approach such as randomly, using standard normal distribution or standard uniform distribution, more training time is required to find a better solution as shown in Fig. 6. Figure 6(a) shows that at any given epoch, the RPROP algorithm achieves the smallest MSE with proposed model followed by RPROP-sparse filtering model. However, randomly initialized weights of RPROP through standard normal distribution observe larger MSE followed by standard uniform distribution. In terms of training time, Fig. 6(b) shows that training algorithm RPROP is slower with randomly initialized weights using standard normal distribution compare to all other initializations. However, RPROP algorithm converge faster with the proposed learning algorithm and sparse filtering method than all other methods.
It is clear from the experimental results that the proposed model has outstanding performance in classification compare to other state of the art methods. The highest accuracy obtained by the proposed architecture proves the efficiency of CSFL to seizure effective and distinguishable features from unlabeled training images for face classification task. As, training plays a significant role in the development and efficiency of CNN. Among the learning algorithms proposed for CNN, the contribution of the proposed learning algorithm (CSFL) are as follows:
It allows CNN to converge fast and require less memory in training compare to the other learning algorithms.
It allows the training algorithm of CNN to optimize its weight more swiftly and precisely to find a good solution.
A novel architecture for face image classification has been presented, in which CSFL has been introduced to learn the filters bank in order to seizure effective and distinguishable features. The major contributions of the proposed method in this paper are concluded as: (1) Provide a good initialization to the weights of CNN, which helps in preventing the system from getting stuck in a local minima. (2) Speed-up the performance of CNN by providing robust initial filter learning. (3) Lessen the dependency of CNN on high amount of training data by treating data in unsupervised manner. We expect this novel model to serve as a baseline for future research. We have confidence that these results will speed up advance research on unsupervised CNN.
In future, we would like to take the advantage of our system by further improving the visual classification tasks by incorporating multi resolution and colour information. Also, we recommend the use of more robust classifier with multitask learning can help the system in achieving better efficiency.
Footnotes
Acknowledgements
This work is supported in part by the National Natural Science Foundation of China (No. U1405254, U1536115, U1536207, 61671030, 61271392), the Excellent Talents Foundation of Beijing, and the Importation and Development of High-Caliber Talents Project of Beijing Municipal Institutions (No. CIT & TCD201404052). The authors wish to thank anonymous reviewers for their valuable comments and suggestions that improved this paper.
