CSFL: A novel unsupervised convolution neural network approach for visual pattern classification

Abstract

With the advancement of technology and expansion of broadcasting around the globe has further boost up biometric surveillance systems. Pattern recognition is the key track in this area. Convolution neural network (CNN) as one of the most prevalent deep learning algorithm has gain high reputation in image features extraction. In this paper, we propose few new twists of unsupervised learning i.e. convolution sparse filter learning (CSFL) to obtain rich and discriminative features of an image. The features extracted by CSFL algorithm are used to initialize the first CNN layer, and then these features are further used in feed forward manner by the CNN to learn high level features for classification. The linear regression classifier (softmax classifier) is used to serve as the output layer of CNN for providing the probability of an image class. We present and examine five different architectures of CNN and error function mean square error (MSE). The experimental results on a public dataset showcase the merit of the proposed method.

Keywords

Convolution neural network classification unsupervised learning feature extraction

1. Introduction

Convolution neural network (CNN) has a diverse range of applications in the field of visual recognition i.e. handwritten digits recognition [17,31], character recognition [19], face detection [55], face recognition [12,28], analysis of facial expression [54] and car detection [64,66].

Many efforts have been done in this regard to address the superiority of neural network over manual visual recognition tasks [1,4,9,11,14,35,51,59,62,65,68,69,71]. The key strength of neural network is that, they are able to change its weight accordingly in order to accomplish the desire output through back propagation, and hence proved competent in situations where analytic elucidation is difficult to attain. However, the main problems which restricted neural network applications in real life tasks are; the computational power requirements, high amount of training data and large memory dependency.

In recent years, many researchers tried to overcome the issue by providing a swift and robust models for neural network. Unsupervised pre-training of CNN [23,26,32] is among them. Onis et al. [43] presented convolutional PCA for object detection by restricting the dependency on training dataset. The authors of [18] proposed PCA based CNN which composed of numerous feature extraction stages and one nonlinear output stage. In their CNN architecture, the filter banks of convolution layer were learned through PCA, and in the nonlinear output stage, binary hashing is applied to solve the detection problem.

Fig. 1.

The proposed method flowchart. The flow chart explains the combine supervised method (through CNN) and unsupervised method (CSFL) achieved for face recognition task.

Li et al. [34] raised the approach for improving the efficiency of traditional CNN. They utilize pre-training of convolution kernels through PCA, which are independent on any other training algorithm such as back propagation in the convolution layers. Huang et al. [23] proposed an unsupervised learning algorithm in object recognition task that is invariant to shifts and distortions. Their system consists of multiple convolution filters, followed by a point wise sigmoid non-linearity, and a feature pooling layer. Similarly, Baccouche et al. [3] handled the shift-invariance for deep convolution neural network, by addition of an extra hidden variable to the objective function. Precisely, a convolutional sparse auto encoder is trained for getting such functionality.

Nair et al. [38] and Hajinoroozi et al. [20] replaced the convolutional filters with restricted Boltzmann machine (RBM) used in traditional CNN, so that the systems can perform more efficient and fast electroencephalography (EEG) based signal classification. However, these systems does not guarantee fast convergence and less computation of CNN training algorithm. Also, the change in weights are not optimized to prevent the system from being stuck in the local minima.

In this paper, we propose a novel unsupervised pre training approach for CNN [47] to improve a specific recognition task i.e. the face recognition. In the face recognition task, we develop a hybrid system which has the advantage of unsupervised filter learning. Our proposed method outline has been shown in Fig. 1. Figure 1 shows that in order to capture rich and discriminative information of faces, the convolution sparse filter learning (CSFL) is employed to learn the filters of the network with a large number of unlabeled faces and non-faces images. The softmax classifier layer is used as the output layer. For a given face/non-face image, the network is able to automatically learns good features to represent the face and output the probability of each type i.e. face or non-face.

The remaining paper is structure as follows. Section 2 describes comprehensive explanation of the related state of the art work. The architecture of CNN is enlightened in Section 3. The proposed approach is explained in Section 4. Experimental results and the comparison of different optimal algorithms are discussed in Section 5. In the end, some conclusion remarks and future guidelines are given in Section 6.

2. Related work

Face is considered to be the most enrich features part of the human body. Meanwhile, face is not a big part in the human body and there are more than 7 billion human beings in the world, which can be easily differentiated through their face patterns. This shows the influence of face on other body parts, since face contains rich and discriminative features which needs to be extracted for learning to provide useful categorization. Therefore, face detection is the key hot topic of the day, due to its wide range of applications from daily life to industry i.e. security access mechanism, biometric identification, network-based video coding, content related video indexing and cutting-edge human and computer communication [8,45–47,50,57,60,70]. Therefore, our proposed CNN approach use face and skin detection as the dataset.

Many recent works have been proposed to boost-up the efficiency of face recognition task. These systems extract features through filter banks (usually based on oriented edge detectors) and non-linear procedure (normalization, quantization, sparsification, winner-take-all, point-wise saturation). Numerous recognition systems first use a features extraction phase followed by a supervised classifier. Specific examples of such systems in the features extraction stage use HoG [10], SIFT features [29,36], Geometric Blur [7], and inspired models of the mammalian visual cortex [49].

Other models use such feature extractors, in two or more consecutive phases, followed by a supervised classifier. This includes CNN totally trained in purely supervised manner with gradient descent [24], or with an auxiliary task [2], or trained in purely unsupervised manner [23,26,32]. Also, Multi-stage systems comprise HMAX-type architectures [37,58] in which the first layer is hardwired with Gabor filters, and the second layer is trained in unsupervised manner by keeping randomly selected output features from the primary phase into filters of the next phase.

All the aforementioned models basically differ from each other by different number of feature extraction stages i.e. one or two (or more), the selection of the filters (supervised, unsupervised, hardwired), by the kind of non-linearity used after the filter banks, and the top-level classifier (linear non-linear or more sophisticated).

Therefore, in this paper we propose a hybrid architecture for CNN, inspired from unsupervised pre training [6,22,25,27,63]. Combing both approaches (supervised and unsupervised) will have much refined results, since it emphasizes natural phenomenon (as human use both supervised and unsupervised learning). The key advantage of the proposed unsupervised pre-filter learning approach CSFL is that, it uses the sparse function $spf (\cdot)$ to measure the sparsity of representations, which make it different from traditional sparsity constraints i.e. $l_{0}$ , $l_{1}$ or $l_{2}$ norms. During the filters learning process, in order to ensure that similar input image patches have similar high-level representations, we considered the manifold assumption [5]. Thus the learned CSFL filters are able to capture rich and discriminative features of faces for elevating the classification efficiency. Preliminary results of this paper is already presented in [52].

In comparison to the existing literatures on face recognition, the key research investigations of this paper are; First, to determine the superiority of deeper CNN over shallow CNN. Second, to develop a robust unsupervised pre-filter learning algorithm for CNN. In this regards, we make the following major contributions: Initially, we develop different architectures of CNN to investigate the role of each layer in face classification task. Then, we develop a new learning algorithm called convolution sparse filter learning (CSFL) for facial classification problem, which is able to capture rich and discriminative features, and helps the training algorithm in weights optimization. Also, it swiftly converges and gives optimize solution due to effortlessly learning of feature representations that are well matched for a range of tasks, comprising object classification. Finally, experimental results shows that the proposed algorithm outperforms previous benchmark algorithms and helps in improving classification efficiency up to 2%.

It is also an obligatory initial stage of face recognition and expression study. Which opens the gate to enrich, highlight, comprehend, and improve human computer interactions.

Fig. 2.

Convolution neural network design. Cnv1 represents the first convolution layer, SS2 serves as the second layer followed by Cnv3 and SS4 which serves as the 3rd and 4th layer of the architecture respectively. The last two layers are the last convolution layer (Cnv5) and output layer (FC6) of the system.

3. Architecture of convolution neural network

LeCun et al. [30] initially proposed the architecture of CNN in the early eighties, and then he along with his colleagues further revised it in the nineties. CNN is an inspired neural network system, based on three significant architectural concepts: weight sharing, local receptive fields, and subsampling in the time domain. The main idea behind CNN design was the recognition of two dimensional visual patterns. The key strength of CNN are: (1) feature extraction and classification are combined into one structure and completely adaptive, (2) the network extract 2-D image features in a feed forward manner, and (3) it is comparatively invariant to geometric and local falsehoods in the image.

Figure 2 shows the proposed architecture of CNN. Our proposed architecture encloses two stages i.e. unsupervised CSFL and supervised learning called convolution neural network [47], which produce low-level confined features and high-level comprehensive features respectively. The high-level comprehensive features provide general descriptions of the whole faces, and the low-level features purpose to describe different patches of the face specifically. Our proposed architecture is in feed-forward manner which incorporates the features learned in both unsupervised and supervised learning phases for categorization, in order to get complete gain of both low level and high level features.

Depth is a very important factor in CNN design. As the computation and memory requirements of CNN greatly increase with increasing the number of layers. Therefore, we choose an optimized architecture of six layers for CNN to avoid high computation and provide easiness in real-time implementation. The six layers of CNN consists of three convolutional layers i.e. “Cnv1”, “Cnv3” and “Cnv5” represents layer 1, 3 and 5 respectively. The two sub-sampling layers “SS2” and “SS4” represents layer 2 and 4 respectively and the only output layer called fully connected “FC6”. In the convolution layer, the CSFL emphasizes operative filters for the system. Average pooling operator is used by the subsampling layer to lessen the spatial resolution of input image. Feature extraction and categorization are merged into a single network and is completely functional. Also, 2-D image features are extracted at increasing feed-forwarded measure. Moreover, it is comparatively robust to local alterations and small shifts in the image [47]. The detail explanation of each CNN layer is presented below.

3.1. Convolution layer

In image data processing application, convolution layers provides non-linear mapping from low-level image depiction to high-level knowledge representation, which simulates the “simple cells” in the typical model of the visual cortex [16]. That’s why convolution layers are considered to be the back bone of CNN.

Each plane in a convolution layer is linked to one or several feature maps of the previous layer depends on the architecture of CNN. A connection is linked with a convolution mask, which is a 2-D matrix of adaptable weights. Weights are adjusted according to the desired output. In each plane, the convolution is computed between its 2-D inputs and convolution masks. Before passing through activation function, the summation of convolution outputs is obtained and after that it is added with an adaptable scalar term called bias. Finally, the planes output is obtained by applying the activation function on the result. Feature map is a 2-D matrix of plane output, it is called feature map on the fact that convolution output shows the existence of a visual feature at a given pixel position [47]. This can be mathematically expressed as: $\begin{array}{l} (1) & F (u, w, b) = z = {z_{k}}_{k = 1, 2, \dots, n}, \\ (2) & z_{k} = tanh (u \otimes w_{k} + b_{k}) . \end{array}$ Here $u \in R^{p \times s \times s}$ is the input data. $w \in R^{n \times p \times m \times m}$ represents the filters set, in which every single filter is $w_{k} \in R^{p \times m \times m}$ . $b \in R^{n}$ represents the bias term of every single filter output. ⊗ is a convolution operator which works on single image input and filter. $z \in R^{n \times (s - m + 1) \times (s - m + 1)}$ represents the output, and also is the extracted features map set of convolution layer. $tanh (\cdot)$ represents the hyperbolic tangent function.

Moreover, to make feature extraction effortless for convolutional layer we propose a new tweaking of unsupervised feature extraction. We demonstrate this by taking element wise sigmoid function $Sig (\cdot)$ serving as the nonlinear activation function. Therefore, $y_{j}$ is described as: $\begin{matrix} (3) & y_{j} = Sig (\sum_{a} K_{a j} \otimes U_{a}) . \end{matrix}$ Here $b \in R^{n}$ represents the convolution operation, $U_{a}$ is the size of 2D feature map and $K_{a j}$ is the learned 2-D filters obtained by CSFL, which is illustrated in Section 4.1.

3.2. Subsampling layer

The key goal of subsampling layer is to make the representation robust to both geometric falsehoods and trivial alterations. The planes of sub-sampling layer are of similar number as that of preceding convolution layer. Moreover, when receiving a 2-D input from the preceding convolution layer it is divided into non-overlapping patches by subsampling plane. The summation of each 4 pixels block is calculated, before being added to a scalar term called bias. This summation is then multiplied by an adaptable weight for output optimization. The outcome is conceded through an activation function to yield an output of $2 \times 2$ region. Noticeably, each sub-sampling plane gives us semi/half image size of its input, along horizontal and vertical dimension. A feature map of this layer is linked to one or more planes in the succeeding convolution layer.

Suppose that l represents a subsampling layer, and l belongs to an even integer i.e. $l = 2, 4, 6, \dots, 2 c$ . c represents the positive integer. For feature map m in layer l, $w_{m}^{l}$ and $b_{m}^{l}$ represent the weight and the bias term respectively. The feature map m for convolution layer $l - 1$ is non-overlapping blocks of $2 \times 2$ pixels size. Consider that $z_{m}^{l - 1}$ represents the output matrix achieved after the summation of 4 pixels in each block, which can be mathematically expressed as: $\begin{matrix} (4) & \begin{matrix} z_{m}^{l - 1} = & y_{m}^{l - 1} (2 x - 1, 2 y - 1) \\ + y_{m}^{l - 1} (2 x - 1, 2 y) \\ + y_{m}^{l - 1} (2 x, 2 y - 1) + y_{m}^{l - 1} (2 x, 2 y) . \end{matrix} \end{matrix}$

For a given subsampling layer l, its desired feature map m can be obtained as $y_{m}^{l} = f_{l} (z_{m}^{l - 1} \times w_{m}^{l} \times b_{m}^{l})$ . Size of a feature map $y_{m}^{l}$ in a sub-sampling layer l is $H_{l} \times W_{l}$ , where $W_{l} = W_{l} - 1 / 2$ and $H_{l} = H_{l} - 1 / 2$ .

In this paper, we use $2 \times 2$ block size for sub-sampling by default, otherwise it can be adjusted to any desired size by giving the sub sampling rate. Consider the input 2-D feature maps of subsampling layer and previous layer are of size $n_{1} \times n_{2}$ , and $m_{1} \times m_{2}$ is the size of the output 2-D feature map, thus we obtain: $\begin{array}{l} (5) & m_{1} = [n_{1} - fil t_{1}] / p_{1} + 1, \\ (6) & m_{2} = [n_{2} - fil t_{2}] / p_{2} + 1 . \end{array}$ Here $fil t_{1}$ and $fil t_{2}$ are the average filter size, $p_{1}$ and $p_{2}$ represent the rate of subsampling procedure horizontally and vertically respectively.

In convolution layer exactly one previous feature map is connected to every solo plane. Size of the input feature map has precisely the same size as that of the convolution mask in that layer. Hence, one scalar output will produce from every single plane in the last convolution layer. At last, every plane outputs in this layer are then linked to the output layer.

3.3. Output layer

Sigmoidal neurons or radial-basis-function neurons usually construct the output layer [21]. Here, our main focus is on using sigmoidal neurons for the output layer. The network output depends on the output of this layer in other words this layer output served as the network output. In applications i.e. visual pattern classification, these outputs specify the class of the input image data. Let L be the output layer contains sigmoidal neurons and $N_{L}$ represents the amount of output sigmoidal neurons. Let $w_{m, n}^{l}$ be the weight from last convolution layer, feature map m, of neuron n and output layer L. The desire output $y_{n}^{l}$ of sigmoidal neuron n is computed below: $\begin{matrix} (7) & y_{n}^{l} = f^{l} (\sum_{m = 1}^{N l - 1} y_{m}^{l - 1} w_{m, n}^{l} + b_{n}^{l}) . \end{matrix}$ From the network output, the output of all sigmoidal neurons is as follow: $\begin{matrix} (8) & y = [y_{1}^{L}, y_{2}^{L}, y_{3}^{L}, \dots, y_{N}^{L}] . \end{matrix}$

4. Proposed approach

Two types of features should be learned in the proposed unsupervised CNN architecture. First, low level features are learned through CSFL, followed by high level features learning and classification through CNN. Since, the efficiency of CNN is highly dependent on the training algorithm which in-turn depends on the amount of input data. Therefore, the key role of the proposed CSFL approach is to help CNN training algorithm in weight optimization to avoid premature convergence and reduce input data dependency.

4.1. Convolution sparse filter learning (CSFL)

To the best of our knowledge, this is the first effort where an unsupervised neural network has been introduced to the face classification problem. The key objective of sparse filtering is that it is hyper-parameter free due to which it works well on a variety of data modalities without specific tuning for each modality. This permits effortlessly learning of feature representations that are well matched for a range of tasks, comprising object classification. The below Table 1 compare the hyper-parameter of the proposed filter learning method i.e. CSFL with other unsupervised learning methods i.e. sparse filtering, ICA, Sparse Coding, Sparse Auto-encoders and Sparse RBMs.

Table 1
Comparison of the proposed filter learning method with different feature learning algorithms in term of adjustable hyper-parameter

Algorithm Adjustable hyper parameter

Proposed method (CSFL) #Features

Sparse filtering [40] #Features

ICA [33] #Features

Sparse Coding [42] #Features, sparsity penalty, mini-batch size

Sparse Autoencoders [39] #Features, target activation, weight decay, sparsity penalty

Sparse RBMs [56] #Features, target activation, weight decay, sparsity penalty, learning rate, momentum

Algorithm	Adjustable hyper parameter
Proposed method (CSFL)	#Features
Sparse filtering [40]	#Features
ICA [33]	#Features
Sparse Coding [42]	#Features, sparsity penalty, mini-batch size
Sparse Autoencoders [39]	#Features, target activation, weight decay, sparsity penalty
Sparse RBMs [56]	#Features, target activation, weight decay, sparsity penalty, learning rate, momentum

Table 1 shows the influence of the proposed filter learning method on other unsupervised learning algorithms due to confined hyper-parameter. Since, for other unsupervised learning algorithms, as the number of features increased, it took significantly longer to solve the L1-regularized least squares problem for finding the coefficients [40]. Similarly, Fig. 3 shows that sparse filtering is comparatively faster (4×) than other unsupervised learning algorithms i.e. ICA, sparse coding and Sparse Auto encoders for large data input dimension [40]. So the proposed CSFL helps the CNN algorithm in fast and optimize solution, compare to other unsupervised learning methods.

Fig. 3.

Convergence time comparison between different unsupervised learning algorithms over different input measure.

4.1.1. Mathematical model of CSFL

This subsection shows the implementation of CSFL algorithm introduced for CNN. The whole implementation process is clearly summarized in Algorithm 1. It is clear from Algorithm 1 that the proposed methodology is completely different from other unsupervised learning algorithms in the existing literatures.

Algorithm 1

Convolution sparse filter learning

Define a data matrix $U = [u_{1}, u_{2}, u_{3}, \dots, u_{n}] \in R^{d \times n}$ , where the columns represents the data points. The key objective is to learn the filter bank $K \in R^{d \times t}$ which consists of filters t. The nonlinear map function is as below: $\begin{matrix} (9) & g = Sig (K^{t r} U) . \end{matrix}$ $g \in R^{t \times n}$ represents a feature distribution matrix over U. $Sig (\cdot)$ represents the component wise sigmoid function which is normally used as neural network activation function. The element $g_{x, y}$ represents the xth feature on the yth example of g. It avoids data distribution explicit modeling, which can increase simple formulation and consents optimize learning for visual recognition task.

We formulate the convolution sparse filter learning algorithm mathematically as: $\begin{array}{l} (A) & \begin{matrix} ECSFL = & ‖ U - \sum_{i} K i \otimes z k ‖_{2}^{2} \\ + spf (g) \\ + ‖ g^{*} - Sig (K^{t r} U) ‖_{2}^{2} \\ + ‖ z^{*} - f (u; w, b) ‖_{2}^{2}, \end{matrix} \\ (B) & \underset{K}{Min} = ‖ g^{*} - Sig (K^{t r} U) ‖_{2}^{2} . \end{array}$ Here $K \in R^{d \times n}$ is a filter bank which consists of n filters. U is an input data matrix. $g^{*}$ is optimum sparse representation of g. The $spf (\cdot)$ in Eq. (A) is optimized through [55]. Let, $g (x, Δ) \in R^{l \times n}$ ( $x = 1, 2, 3, \dots, t$ ) represents the xth row of g and $g (Δ, y) \in R^{t \times l}$ ( $y = 1, 2, 3, \dots, n$ ) represents the yth column of g. The $spf (\cdot)$ function is calculated by first normalizing the feature distribution matrix by rows, secondly normalizing the feature distribution matrix by column and finally, obtain the summation of absolute values of all entries.

Precisely, first each feature is normalized to be equally active by dividing its $l_{2}$ norm through all examples, i.e. $g (x, Δ) = g (x, Δ) / ‖ g (x, Δ) ‖_{2}$ . Secondly, each column is divided by its $l_{2}$ norm through all features, in order to place on the unit $l_{2}$ -ball plane, i.e. $g (Δ, y) = g (Δ, y) / ‖ g (Δ, y) ‖_{2}$ . In the last step, obtain the summation of absolute values of all entries in g using $L_{1}$ penalty. Hence sparse filtering function can be written as: $\begin{matrix} (10) & \begin{matrix} Min ‖ g ‖ & = \sum_{x = 1}^{t} \sum_{y = 1}^{n} | g (x, y) | \\ = \sum_{y = 1}^{n} {‖ g (Δ, y) ‖}_{1} \\ = \sum_{y = 1}^{n} {‖ \frac{g (Δ, y)}{‖ g (Δ, y) ‖^{2}} ‖}_{1} . \end{matrix} \end{matrix}$

Implementation of convolution sparse filter learning is easy, since it is hyper-parameter free. The optimization of (10) is achieved through L-BFGS method [41]. An approximation is introduced, since the objective function comprises absolute value operators which are non-differentiable. The objective function gradient is calculated by back-propagation method and the absolute value operators are flouted. Different filter of $5 \times 5$ pixels and $3 \times 3$ pixels is set for Cnv1 and Cnv3 layers and is auto-calculated for Cnv5 and FC6 layers, for different phases of the CNN. CSFL learns the filters from numerous $9 \times 9$ size patches which are randomly extracted from the input.

Algorithm 1 summarizes the complete optimization process of the proposed CSFL. “Convergence” is defined as the objective function value difference in Eq. (9) smaller than threshold or one iterative time surpasses another threshold.

4.2. Training CNN

Suppose the training set has U input data matrix and U desired output representations. Suppose $X^{U}$ represent the uth training image and $d^{u}$ is our corresponding preferred output vector. $y_{n}^{u}$ represent the actual network output. This shows the function of all network parameters i.e. weights and biases. Error function of a CNN can be mathematically expressed as: $\begin{matrix} (11) & Error (w) = \frac{1}{U \times N L} \sum_{u = 1}^{U} \sum_{n = 1}^{N L} {(y_{n}^{U} - d_{n}^{U})}^{2} . \end{matrix}$

Face window size of $20 \times 20$ pixels, has been reported by many researchers [13,15,44,55,61,67] as being the smallest window used in face detection without losing key evidence. Whereas, this face window of $20 \times 20$ pixels contain very central part of the face, and CNN has many layers of processing. Hence, to avoid information loss, in this paper we choose window size of $32 \times 32$ by using an error and trail method after comprehensive mathematical experiments on a wealth of case studies. Choosing this face window size proved beneficial for our proposed architecture.

First, as we developed an unsupervised CNN in which data passed through many intermediate processing layers between the input and output in a feedforward manner. The network is fed with some extra information of face shape, which helps the unsupervised CNN to learn more information and reduce the number of false alarms produced that are certainly to appear when only focused on the central part of face. Second, due to coarse manual labeling of facial points produce solid errors while cropping, as a result training process are affected with in the network. Since our proposed architecture works in feed forward way therefore small errors in manual labeling reinforce this ability by providing examples that are precisely un-stabilized. This shows that, great concentration is required while gathering training examples of face patterns alignment.

Our unsupervised CNN is trained on back-propagation RPROP algorithm [53] which is an efficient learning method that perform weight step alteration based on local gradient data. The key aim of using this training method in our unsupervised CNN is that, the weight adaptation is not blurred by gradient behavior at all. Let ${del}_{x y}$ be the individual update value of each weight, which determine the size of weight increase or decrease. The increase/decrease in the size of weight is solely determined by the local sight of error function E, based on the below learning conditions: $\begin{matrix} (12) & {del}_{x y}^{t} = \{\begin{matrix} ρ^{+} \times {del}_{x y}^{t - 1} & if \frac{\partial E^{t - 1}}{\partial w_{x y}} \times \frac{\partial E^{t}}{\partial w_{x y}} > 0, \\ ρ^{-} \times {del}_{x y}^{t - 1} & if \frac{\partial E^{t - 1}}{\partial w_{x y}} \times \frac{\partial E^{t}}{\partial w_{x y}} < 0, \\ {del}_{x y}^{t - 1} & elsewhere . \end{matrix} \end{matrix}$ Here $0 < ρ^{-} < 1 < ρ^{+}$ .

Since weight is adjusted in each iteration/epoch to make actual output closer to the desire output. In each iteration, the algorithm jumps to local minima and the update value ${del}_{x y}$ is decreased by factor $ρ^{-}$ , indicating that the last update was too big due to partial derivative of corresponding weigh $w_{x y}$ sign changes. However, if the derivative maintains its sign, the update value is slightly augmented so as to speed-up convergence in order to escape from local minima. Moreover, when each weight updated value is adapted, the update of weight occurs according to the following rule.

The update-value decreased the weight, in-case of error is growing. Conversely, in case of negative derivative then the weight is increased by its update-value i.e. being added. This can be mathematically illustrated as: $\begin{array}{l} (13) & {delw}_{x y}^{t} = \{\begin{matrix} - {del}_{x y}^{t} & if \frac{\partial E^{t - 1}}{\partial w_{x y}} > 0, \\ + {del}_{x y}^{t} & if \frac{\partial E^{t - 1}}{\partial w_{x y}} < 0, \\ 0 & elsewhere, \end{matrix} \\ (14) & w_{x y}^{t + 1} = w_{x y}^{t} + {delw}_{x y}^{t} . \end{array}$

Nevertheless, the previous update of weight is reverted if the minima was missed due to too large previous step. Mathematically: $\begin{matrix} (15) & \begin{matrix} {delw}_{x y}^{t} = - {delw}_{x y}^{t - 1} \\ if \frac{\partial E^{(t - 1)}}{\partial w_{x y}} * \frac{\partial E^{(t)}}{\partial w_{x y}} < 0 . \end{matrix} \end{matrix}$

Therefore, due to this backtracking weight step, the derivatives are supposed to alter its sign once more in the following step. So, to avoid this computational expense of the update-value, there should be no adaptation in the succeeding step of the update value by setting $\frac{d E^{(t - 1)}}{d w_{x y}} = 0$ in ${del}_{x y}$ adaptation rule above.

5. Experimental results and discussion

The proposed unsupervised CNN is assessed on face and skin detection dataset [48]. In order to assess the influence of face classification on given dataset and to compare it with other optimal classification methods, a broad set of experiments are performed. Before probing into the analysis and evaluation parts, we initially explain the dataset groundwork method.

5.1. Dataset

The dataset used in our experiments is a complex and challenging dataset taken from face and skin detection dataset [48] containing 4000 color images that are varied in terms of contextual scenes, illumination conditions, and face and skin kinds. The illumination circumstances comprise indoor lighting and alfresco lighting. Out of the total 4000 images, 1931 images are taken in indoor lighting conditions, 1855 images are taken in alfresco lighting conditions, and the remaining 214 images are taken in other illumination conditions. The skin natures include yellowish, whitish, darkish and brownish skins, which can be categorized in the whole data set i.e. whitish and pink skin images are taken as 1665, yellowish and light brownish skin images are taken as 1402, reddish, darkish and dark brownish skin image are taken as 965, and 102 other skin types images are taken in to consideration.

We develop a training dataset of 2000 images, by taking one thousand face and one thousand non-face images from [48] as shown in Fig. 4 by fixing the size of individual image to $32 \times 32$ pixels. To provide the ground-truth, all images are precisely segmented for face regions.

Fig. 4.

Face and non-face outlines of training sample.

5.2. Results and discussion

In this sub-section, we describe the performance evaluation of the proposed CNN architecture on face and skin detection dataset. We evaluate two main components of CNN for experimental purpose i.e. depth and filters learning.

In order to investigate the system depth and filter learning contribution in the performance evaluation. We compare and investigate the performance of five different kinds of CNN architectures for experimental purpose. The aim of implementing these different architectures for comparison is to assess the superiority of the proposed architecture over state of the art algorithms. The four different architectures along with the proposed architecture are described as follow:

Table 2
Training algorithm RPROP with different weight initialization parameters

Training algorithm Weight initialization Total CNN layers

[47] Standard normal distribution 6

RPROP-SUD Standard uniform distribution 6

RPROP-sparse filtering Sparse filtering 6

RPROP-RN Standard normal distribution 4

Proposed model CSFL 6

Training algorithm	Weight initialization	Total CNN layers
[47]	Standard normal distribution	6
RPROP-SUD	Standard uniform distribution	6
RPROP-sparse filtering	Sparse filtering	6
RPROP-RN	Standard normal distribution	4
Proposed model	CSFL	6

Table 3

Mean square error (MSE) obtained at different training epochs by using several filter learning methods on face dataset [48]

Filter learning method	MSE at 100 Epoch	MSE at 200 Epoch	MSE at 400 Epoch	MSE at 600 Epoch	MSE at 900 Epoch	MSE at 1100 Epoch
Randomly for CNN-RN-6 [47]	0.2521005	0.2044492	0.13242635	0.1045397	0.072238498	0.068234796
Randomly using standard uniform distribution for CNN-SUD-6	0.59115448	0.3871029	0.18474131	0.1453697	0.098927739	0.084553017
Sparse filtering for CNN-sparse filtering [52]	0.1937667	0.123994	0.0903676	0.0763471	0.04336342	0.02873581
Randomly using standard normal distribution for CNN-RN-4	0.92997785	0.7197634	0.33171603	0.267941	0.16998999	0.16223717
Proposed model	0.1212366	0.03994004	0.02369597	0.02008119	0.02	0.0173318

Table 4

Comparison results at different training epochs of various methods on the dataset in [48]

Filter learning method	MSE at 100 Epoch	MSE at 200 Epoch	MSE at 400 Epoch	MSE at 600 Epoch	MSE at 900 Epoch	MSE at 1100 Epoch
CNN-RN-6 [47]	0.2521005	0.2044492	0.13242635	0.1045397	0.072238498	0.068234796
CNN-SUD-6	0.59115448	0.3871029	0.18474131	0.1453697	0.098927739	0.084553017
CNN-sparse filtering [52]	0.1937667	0.123994	0.0903676	0.0763471	0.04336342	0.02873581
CNN-RN-4	0.92997785	0.7197634	0.33171603	0.267941	0.16998999	0.16223717
Proposed model	0.1212366	0.03994004	0.02369597	0.02008119	0.02	0.0173318

CNN-RN-4 is a standard 4 layer CNN architecture of face classification. In which, first convolution layer is initialized randomly by standard normal distribution having zero mean and variance one.

CNN-RN-6 [47] is a standard 6 layer CNN architecture of face classification. In which, first convolution layer is initialized randomly by standard normal distribution having zero mean and variance one.

CNN-SUD-6 is a standard 6 layers CNN architecture of face classification. In which, first convolution layer is initialized randomly using standard uniform distribution.

CNN-sparse filtering [52] is based on standard 6 layer CNN architecture. In which, first convolution layer is convolved with 2-D filters learned through sparse filtering.

The proposed model is based on standard 6 layer CNN architecture. CSFL served as the role of initial filter learning, which is further use by CNN for higher feature learning and classification.

All these CNN architectures are trained on a same training algorithm i.e. RPROP. Table 2 summarizes the training algorithm RPROP with different weight initialization learning methods taken for comparison. Each network is trained for 2000 epochs using the same training algorithm i.e. RPROP. In order to measure the performance of each network, we use ten folds cross validation of different parameters namely, MSE training, number of training epochs and training time for comparison.

Fig. 5.

Training MSE against CNN epochs. (a) [47]. (b) CNN-SUD-6. (c) CNN-RN-4. (d) [52]. (e) Proposed model.

Thus, Table 3 and Table 4 reveals the MSE obtained at different training epochs by using several filter learning methods. It is clear that when CNN is initialized with proposed model the RPROP algorithm reaches lower MSE quickly compared to all other initialization algorithms. For example, to reach MSE of 0.12, the RPROP takes only 100 epochs with the proposed algorithm, whereas RPROP needs 350 and 310 epochs with the approaches [47] and [52] respectively. Figure 5 shows that the proposed CSFL outperforms all other learning filter models in term of lower MSE. Hence we conclude that the filters obtained through CSFL are able to escape from local minima well compare to other standard methods. Furthermore, Table 5 and Table 6 shows the classification accuracy of all different models proposed for experimental purpose. It is clear that CNN-RN-6 and CNN-SUD-6 networks have similar classification rates of 98.30% and 97.90% respectively on the training dataset, and 94.83% and 93.16% on the testing dataset (the difference is no more than 0.8%). Whereas, the CNN-RN-4 network achieves the lowest classification rates (95.35% and 91.66%). The highest classification rates are achieve by the proposed method which are 99.52% and 97.78% on training and testing dataset respectively. These relative performances are consistent with the training speed comparison discussed earlier.

Table 5

Classification accuracy versus different filter learning methods on face and skin dataset [48]

Filter learning method	Accuracy

	Training dataset	Test dataset
Randomly for CNN-RN-6 [47]	98.30%	94.83%
Randomly using standard uniform distribution for CNN-SUD-6	97.90%	93.16%
Sparse filtering [40] for CNN-sparse filtering [52]	99.20%	96.32%
Randomly using standard normal distribution for CNN-RN-4	95.35%	91.66%
Proposed model	99.52%	97.78%

Table 6

Classification accuracy of different stochastic methods on the dataset in [48]

Methods	Accuracy

	Training dataset	Test dataset
CNN-RN-6 [47]	98.30%	94.83%
CNN-SUD-6	97.90%	93.16%
CNN-sparse filtering [52]	99.20%	96.32%
CNN-RN-4	95.35%	91.66%
Proposed model	99.52%	97.78%

Fig. 6.

Comparison on the standard face and skin detection dataset of training algorithm RPROP with different weight initialization. (a) Training MSE versus the number of training epochs, and (b) training MSE versus the training time.

CNN initialization with non-optimal filter learning approach such as randomly, using standard normal distribution or standard uniform distribution, more training time is required to find a better solution as shown in Fig. 6. Figure 6(a) shows that at any given epoch, the RPROP algorithm achieves the smallest MSE with proposed model followed by RPROP-sparse filtering model. However, randomly initialized weights of RPROP through standard normal distribution observe larger MSE followed by standard uniform distribution. In terms of training time, Fig. 6(b) shows that training algorithm RPROP is slower with randomly initialized weights using standard normal distribution compare to all other initializations. However, RPROP algorithm converge faster with the proposed learning algorithm and sparse filtering method than all other methods.

It is clear from the experimental results that the proposed model has outstanding performance in classification compare to other state of the art methods. The highest accuracy obtained by the proposed architecture proves the efficiency of CSFL to seizure effective and distinguishable features from unlabeled training images for face classification task. As, training plays a significant role in the development and efficiency of CNN. Among the learning algorithms proposed for CNN, the contribution of the proposed learning algorithm (CSFL) are as follows:

It allows CNN to converge fast and require less memory in training compare to the other learning algorithms.

It allows the training algorithm of CNN to optimize its weight more swiftly and precisely to find a good solution.

6. Conclusion

A novel architecture for face image classification has been presented, in which CSFL has been introduced to learn the filters bank in order to seizure effective and distinguishable features. The major contributions of the proposed method in this paper are concluded as: (1) Provide a good initialization to the weights of CNN, which helps in preventing the system from getting stuck in a local minima. (2) Speed-up the performance of CNN by providing robust initial filter learning. (3) Lessen the dependency of CNN on high amount of training data by treating data in unsupervised manner. We expect this novel model to serve as a baseline for future research. We have confidence that these results will speed up advance research on unsupervised CNN.

In future, we would like to take the advantage of our system by further improving the visual classification tasks by incorporating multi resolution and colour information. Also, we recommend the use of more robust classifier with multitask learning can help the system in achieving better efficiency.

Footnotes

Acknowledgements

This work is supported in part by the National Natural Science Foundation of China (No. U1405254, U1536115, U1536207, 61671030, 61271392), the Excellent Talents Foundation of Beijing, and the Importation and Development of High-Caliber Talents Project of Beijing Municipal Institutions (No. CIT & TCD201404052). The authors wish to thank anonymous reviewers for their valuable comments and suggestions that improved this paper.

References

K.J.

Adebayo,

O.W.

Onifade and

F.I.

Yisa, Comparative analysis of PCA-based and neural network based face recognition systems, in: 2012 12th International Conference on Intelligent Systems Design and Applications (ISDA), IEEE, 2012, pp. 28–33.

Ahmed,

Yu,

Xu,

Gong and

Xing, Training hierarchical feed-forward visual recognition models using transfer learning from pseudo-tasks, in: Computer Vision – ECCV 2008, Springer, 2008, pp. 69–82.

Baccouche,

Mamalet,

Wolf,

Garcia and

Baskurt, Sparse shift-invariant representation of local 2D patterns and sequence learning for human action recognition, in: 2012 21st International Conference on Pattern Recognition (ICPR), IEEE, 2012, pp. 3823–3826.

Belghini,

Zarghili,

Kharroubi and

Majda, Color facial authentication system based on neural network, in: 2011 Colloquium in Information Science and Technology (CIST), IEEE, 2011, p. 8.

Belkin,

Niyogi and

Sindhwani, Manifold regularization: A geometric framework for learning from labeled and unlabeled examples, Journal of Machine Learning Research 7 (2006), 2399–2434.

Bengio,

Lamblin,

Popovici and

Larochelle, Greedy layer-wise training of deep networks, in: Advances in Neural Information Processing Systems, 2007, pp. 153–160.

A.C.

Berg,

T.L.

Berg and

Malik, Shape matching and object recognition using low distortion correspondences, in: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2005, Vol. 1, IEEE, 2005, pp. 26–33.

F.G.

Bulnes, Classification of periodical defects in inspection systems based on computer vision, AI Communications 25(4) (2012), 385–386.

Cao,

Wei,

Han and

Lin, Robust face clustering via tensor decomposition, IEEE Transactions on Cybernetics 45(11) (2015), 2546–2557. doi:10.1109/TCYB.2014.2376938.

10.

Dalal and

Triggs, Histograms of oriented gradients for human detection, in: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2005, Vol. 1, IEEE, 2005, pp. 886–893.

11.

A.J.

Dhanaseely,

Himavathi and

Srinivasan, Principal component analysis based cascade neural network for face recognition, in: 2012 International Conference on Emerging Trends in Science, Engineering and Technology (INCOSET), IEEE, 2012, pp. 255–259.

12.

M.J.

Er,

Chen and

Wu, High-speed face recognition based on discrete cosine transform and RBF neural networks, IEEE Transactions on Neural Networks 16(3) (2005), 679–691. doi:10.1109/TNN.2005.844909.

13.

Erhan,

Bengio,

Courville,

P.-A.

Manzagol,

Vincent and

Bengio, Why does unsupervised pre-training help deep learning, Journal of Machine Learning Research 11 (2010), 625–660.

14.

Fatahi,

Zadkhosh and

Chalechale, Face recognition with linear discriminant analysis and neural networks, in: 2013 First Iranian Conference on Pattern Recognition and Image Analysis (PRIA), IEEE, 2013, pp. 1–4.

15.

Feraund,

O.J.

Bernier,

J.-E.

Viallet and

Collobert, A fast and accurate face detector based on neural networks, IEEE Transactions on Pattern Analysis and Machine Intelligence 23(1) (2001), 42–53. doi:10.1109/34.899945.

16.

Fukushima and

Miyake, Neocognitron: A new algorithm for pattern recognition tolerant of deformations and shifts in position, Pattern Recognition 15(6) (1982), 455–469. doi:10.1016/0031-3203(82)90024-3.

17.

Fukushima and

Wake, Handwritten alphanumeric character recognition by the neocognitron, IEEE Transactions on Neural Networks 2(3) (1991), 355–365. doi:10.1109/72.97912.

18.

Gan,

Liu,

Dong and

Zhong, A PCA-based convolutional network, arXiv:1505.03703, 2015.

19.

M.D.

Ganis,

C.L.

Wilson and

J.L.

Blue, Neural network-based systems for handprint OCR applications, IEEE Transactions on Image Processing 7(8) (1998), 1097–1112. doi:10.1109/83.704304.

20.

Hajinoroozi,

Mao and

Huang, Prediction of driver’s drowsy and alert states from EEG signals with deep learning, in: 2015 IEEE 6th International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP), IEEE, 2015, pp. 493–496.

21.

Hatipoglu and

Bilgin, Classification of histopathological images using convolutional neural network, in: 2014 4th International Conference on Image Processing Theory, Tools and Applications (IPTA), IEEE, 2014, pp. 1–6.

22.

G.E.

Hinton and

R.R.

Salakhutdinov, Reducing the dimensionality of data with neural networks, Science 313(5786) (2006), 504–507. doi:10.1126/science.1127647.

23.

F.J.

Huang,

Boureau,

LeCun et al., Unsupervised learning of invariant feature hierarchies with applications to object recognition, in: 2007 IEEE Conference on Computer Vision and Pattern Recognition, CVPR’07, IEEE, 2007, pp. 1–8.

24.

F.J.

Huang and

LeCun, Large-scale learning with SVM and convolutional nets for generic object recognition, in: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2006.

25.

Jarrett,

Kavukcuoglu,

LeCun et al., What is the best multi-stage architecture for object recognition, in: 2009 IEEE 12th International Conference on Computer Vision, IEEE, 2009, pp. 2146–2153.

26.

Kavukcuoglu,

Fergus,

LeCun et al., Learning invariant features through topographic filter maps, in: 2009 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009, IEEE, 2009, pp. 1605–1612.

27.

Kavukcuoglu,

Sermanet,

Boureau,

Gregor,

Mathieu and

Y.L.

Cun, Learning convolutional feature hierarchies for visual recognition, in: Advances in Neural Information Processing Systems, 2010, pp. 1090–1098.

28.

Lawrence,

C.L.

Giles,

A.C.

Tsoi and

A.D.

Back, Face recognition: A convolutional neural-network approach, IEEE Transactions on Neural Networks 8(1) (1997), 98–113. doi:10.1109/72.554195.

29.

Lazebnik,

Schmid and

Ponce, Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories, in: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol. 2, IEEE, 2006, pp. 2169–2178.

30.

LeCun,

B.E.

Boser,

J.S.

Denker,

Henderson,

R.E.

Howard,

W.E.

Hubbard and

L.D.

Jackel, Handwritten digit recognition with a back-propagation network, in: Advances in Neural Information Processing Systems, 1990, pp. 396–404.

31.

LeCun,

Bottou,

Bengio and

Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE 86(11) (1998), 2278–2324. doi:10.1109/5.726791.

32.

Lee,

Grosse,

Ranganath and

A.Y.

Ng, Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations, in: Proceedings of the 26th Annual International Conference on Machine Learning, ACM, 2009, pp. 609–616.

33.

T.W.

Lee, Independent Component Analysis, Springer, 1998, pp. 27–66.

34.

Li,

Yang,

Chen and

Zhu, A pre-training strategy for convolutional neural network applied to Chinese digital gesture recognition, in: 2016 8th IEEE International Conference on Communication Software and Networks (ICCSN), IEEE, 2016, pp. 620–624.

35.

G.L.

Libralon and

R.A.F.

Romero, Mapping of facial elements for emotion analysis, in: 2014 Brazilian Conference on Intelligent Systems (BRACIS), IEEE, 2014, pp. 222–227.

36.

D.G.

Lowe, Distinctive image features from scale-invariant keypoints, International Journal of Computer Vision 60(2) (2004), 91–110. doi:10.1023/B:VISI.0000029664.99615.94.

37.

Mutch and

D.G.

Lowe, Multiclass object recognition with sparse, localized features, in: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol. 1, IEEE, 2006, pp. 11–18.

38.

Nair and

G.E.

Hinton, Rectified linear units improve restricted Boltzmann machines, in: Proceedings of the 27th International Conference on Machine Learning (ICML-10), 2010, pp. 807–814.

39.

Ng, Sparse autoencoder, CS294A Lecture Notes 72(2011) (2011), 1–19.

40.

Ngiam,

Chen,

S.A.

Bhaskar,

P.W.

Koh and

A.Y.

Ng, Sparse filtering, in: Advances in Neural Information Processing Systems, 2011, pp. 1125–1133.

41.

Nocedal, Updating quasi-Newton matrices with limited storage, Mathematics of Computation 35(151) (1980), 773–782. doi:10.1090/S0025-5718-1980-0572855-7.

42.

B.A.

Olshausen and

D.J.

Field, Sparse coding with an overcomplete basis set: A strategy employed by V1, Vision Research 37(23) (1997), 3311–3325. doi:10.1016/S0042-6989(97)00169-7.

43.

Onis,

Garcia,

Sanson and

J.-L.

Dugelay, Object detection with a minimal set of examples using convolutional PCA, in: 2009 IEEE International Workshop on Multimedia Signal Processing, MMSP’09, IEEE, 2009, pp. 1–4.

44.

Osuna,

Freund and

Girosit, Training support vector machines: An application to face detection, in: 1997 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Proceedings, IEEE, 1997, pp. 130–136.

45.

Pan,

Lei,

Zhang,

Sun and

Kwong, Fast motion estimation based on content property for low-complexity H. 265/HEVC encoder, IEEE Transactions on Broadcasting 62(3) (2016), 675–684. doi:10.1109/TBC.2016.2580920.

46.

Pan,

Zhang and

Kwong, Efficient motion and disparity estimation optimization for low complexity multiview video coding, IEEE Transactions on Broadcasting 61(2) (2015), 166–176. doi:10.1109/TBC.2015.2419824.

47.

S.L.

Phung and

Bouzerdoum, A pyramidal neural network for visual pattern recognition, IEEE Transactions on Neural Networks 18(2) (2007), 329–343. doi:10.1109/TNN.2006.884677.

48.

S.L.

Phung,

Bouzerdoum and

Chai, Skin segmentation using color pixel classification: Analysis and comparison, IEEE Transactions on Pattern Analysis and Machine Intelligence 27(1) (2005), 148–154. doi:10.1109/TPAMI.2005.17.

49.

Pinto,

D.D.

Cox and

J.J.

DiCarlo, Why is real-world visual object recognition hard?, PLoS Computational Biology 4(1) (2008), e27.

50.

Pulina and

Tacchella, Challenging SMT solvers to verify neural networks, AI Communications 25(2) (2012), 117–135.

51.

B.A.

Rajoub and

Zwiggelaar, Thermal facial analysis for deception detection, IEEE Transactions on Information Forensics and Security 9(6) (2014), 1015–1023. doi:10.1109/TIFS.2014.2317309.

52.

S.U.

Rehman,

Tu,

Huang and

Yang, Face recognition: A novel un-supervised convolutional neural network method, in: IEEE International Conference of Online Analysis and Computing Science (ICOACS), IEEE, 2016, pp. 139–144.

53.

Riedmiller and

Braun, A direct adaptive method for faster backpropagation learning: The RPROP algorithm, in: 1993 IEEE International Conference on Neural Networks, IEEE, 1993, pp. 586–591.

54.

Rosenblum,

Yacoob and

L.S.

Davis, Human expression recognition from motion using a radial basis function network architecture, IEEE Transactions on Neural Networks 7(5) (1996), 1121–1138. doi:10.1109/72.536309.

55.

H.A.

Rowley,

Baluja and

Kanade, Neural network-based face detection, IEEE Transactions on Pattern Analysis and Machine Intelligence 20(1) (1998), 23–38. doi:10.1109/34.655647.

56.

Salakhutdinov,

Mnih and

Hinton, Restricted Boltzmann machines for collaborative filtering, in: Proceedings of the 24th International Conference on Machine Learning, ACM, 2007, pp. 791–798.

57.

Sebag, A tour of machine learning: An AI perspective, AI Communications 27(1) (2014), 11–23.

58.

Serre,

Wolf and

Poggio, Object recognition with features inspired by visual cortex, in: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2005, Vol. 2, IEEE, 2005, pp. 994–1000.

59.

Somanath,

M.V.

Rohith and

Kambhamettu, Vadana: A dense dataset for facial image analysis, in: 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), IEEE, 2011, pp. 2175–2182.

60.

Sousa and

J.S.

Cardoso, The data replication method for the classification with reject option, AI Communications 26(3) (2013), 281–302.

61.

K.-K.

Sung and

Poggio, Example-based learning for view-based human face detection, IEEE Transactions on Pattern Analysis and Machine Intelligence 20(1) (1998), 39–51. doi:10.1109/34.655648.

62.

A.J.

Tallón-Ballesteros, Contemporary training methodologies based on evolutionary artificial neural networks with product and sigmoid neurons for classification, AI Communications 29(3) (2016), 469–471. doi:10.3233/AIC-150681.

63.

J.M.

Valls,

I.M.

Galván and

Isasi, LRBNN: A lazy radial basis neural network model, AI Communications 20(2) (2007), 71–86.

64.

Wen,

Shao,

Xue and

Fang, A rapid learning algorithm for vehicle classification, Information Sciences 295 (2015), 395–406. doi:10.1016/j.ins.2014.10.040.

65.

Xu,

Lu,

Gao,

Wang and

Yan, Facial analysis with a Lie group kernel, IEEE Transactions on Circuits and Systems for Video Technology 25(7) (2015), 1140–1150. doi:10.1109/TCSVT.2014.2365655.

66.

Yamaguchi and

Itakura, A car detection system using the neocognitron, in: IEEE International Joint Conference on Neural Networks, Vol. 2, 1991, pp. 1208–1213.

67.

M.-H.

Yang,

Roth and

Ahuja, A SNoW-based face detector, in: Advances in Neural Information Processing Systems, 2000, pp. 862–868.

68.

M.H.

Yap,

Ugail and

Zwiggelaar, A database for facial behavioural analysis, in: 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), IEEE, 2013, pp. 1–6.

69.

M.H.

Yap,

Ugail,

Zwiggelaar and

Rajoub, Facial image processing for facial analysis, in: 2010 IEEE International Carnahan Conference on Security Technology (ICCST), IEEE, 2010, pp. 198–204.

70.

Yuan,

Sun and

Lv, Fingerprint liveness detection based on multi-scale LPQ and PCA, China Communications 13(7) (2016), 60–65. doi:10.1109/CC.2016.7559076.

71.

Zhao,

T.-K.

Kim and

Luo, Unified face analysis by iterative multi-output random forests, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1765–1772.