Abstract
Deep learning using fuzzy is highly modular and more accurate. Adaptive Fuzzy Anisotropy diffusion filter (FADF) is used to remove noise from the image while preserving edges, lines and improve smoothing effects. By detecting edge and noise information through pre-edge detection using fuzzy contrast enhancement, post-edge detection using fuzzy morphological gradient filter and noise detection technique. Convolution Neural Network (CNN) ResNet-164 architecture is used for automatic feature extraction. The resultant feature vectors are classified using ANFIS deep learning. Top-1 error rate is reduced from 21.43% to 18.8%. Top-5 error rate is reduced to 2.68%. The proposed work results in high accuracy rate with low computation cost. The recognition rate of 99.18% and accuracy of 98.24% is achieved on standard dataset. Compared to the existing techniques the proposed work outperforms in all aspects. Experimental results provide better result than the existing techniques on FACES 94, Feret, Yale-B, CMU-PIE, JAFFE dataset and other state-of-art dataset.
Keywords
Introduction
Computer vision, multimedia, and pattern recognition focus on face recognition and fingerprint recognition for reducing the malpractices and illegal activities through surveillance cameras and computer interfaced security system. Identification and authentication has been widely implemented in security system. Face recognition is commonly used since it identifies the person exactly. Still, occluded images, pose, expression variation, illumination variation, controlled and uncontrolled images pose challenges [1].
Image processing identifies expressive information from an image by preprocessing, extracting features and classifying them. Preprocessing enhances the contrast of the image, eliminates noise from the image through filters and preserves the edges and line information. A number of noise removal methods are available to deal with, Gaussian noise, salt and pepper noise and Poisson noise. Wiener filter, median filter, and linear filter are used for removing noise.
Convolution neural network overcomes the drawback of artificial neural network (ANN) which is computationally expensive. CNN [2] automatically extracts the features from the image by reducing the parameter required to setup the model. It allows to encode image specific features into the architecture making it appropriate for image focused task. Certain common layer such as hidden layer, pooling layer, convolution layers are stacked as neural network to form CNN. Multiple hidden layers are stacked up on each other and is called as deep learning. CNN architecture exists as Vgg-vd16 [4], Vgg-vd19 [5], ResNet-101 [6] and AlexNet [3]. Based on the system performance and the error rate any one of these architectures is selected. In this work ResNet-101 architecture is used to evaluate the accuracy of the system.
Face recognition related feature extraction and dimensionality reduction have been attempted using LBP [7], RLBP [9], texture (GLCM) [13], local directional ternary pattern [14], sparse manifold subspace learning [8], PCA [10], LDA [11] and ADA [12]. Classification has been achieved using machine learning such as K-NN [15], SVM [16], and neural network [17] to identify the correct individual among the testing images. Deep learning is an extension of neural network from machine learning concepts.
The proposed work removes noise using fuzzy anisotropy diffusion and preserves the edges by despekcle the noise. Preprocessed image is given as input to the convolution neural network ResNet-164 architecture to extract feature vector. Finally, fuzzy min-max hyperbox deep learning is used for classification. The related works are presented next.
Related works
The deep convolution neural network on HyperFace images with four challenging factors such as face detection, landmarks localization, pose estimation and gender recognition [1]. The HyperFace is divided into two factors HyperFace-ResNet and Fast-HyperFace to accomplish state-of-the-art performance and a high face indicator to increase the rapidity of the algorithm. They capture both local and global information and the experiment results are more significantly better than the other competitive algorithms. AFLW face dataset and Pascal face dataset are compared.
CNN has convolutional layers, pooling layers and fully connected layers [18]. CNN is a type of deep learning which learns the system very fast. A weighted mixture deep neural network is automatically extracting the feature for facial expression recognition [19]. They used CK+ [35], JAFFE [37], Oulu-CASIA [36] dataset and obtained recognition accuracy of 0.970, 0.922, and 0.923 respectively.
CNN architecture with five convolution layers followed by one max pooling layer and three fully connected layers followed by Softmax [3]. GPU and non-saturating neuron is used in improving the performance of the system. Dropout is the technique adopted in fully connected layer to reduce over fit. ImageNet LSVRC data set is used in the implementation. They achieved top-5 error rate of 15.3% compared to the next entry in that layer which is 26.2% error rate.
The authors [4] in their work proposed to solve two facts namely first how large scale dataset is accumulated by grouping of automation and human in the loop. Second the complexity in deep neural network. They also discuss about the data purity, performance rate and time complexity. Standard LFW and YTF face benchmark data set is used in the implementation. They achieved 98.95% accuracy rate for LFW and 97.3 for YTF dataset.
Very deep neural network architecture to reduce the error rate from top-1 to top-5 using very small 3×3 convolution filter layers [5]. The ConvNet architecture is used and achieved top 1 error rate of 25.5 and top-5 error rate of 8.0. ImageNet dataset is used in the implementation and provides better result than existing state of the art results.
Deep residual learning (ResNet) architecture is implemented to improve deep convolution neural network for deeper training network than the existing network [6]. ImageNet dataset is used with 152 layers which is 8×8 deeper than VGG Net, 3×3 convolution with 512 layers followed by average pooling and fully connected 1000 layers. 3.57% error rate was obtained which is best when compared to the existing results.
Fuzzy deep neural network with sparse autoencoder (FDNNSA) to understanding intention of human being based on human emotions and information such as age, gender, and region in which the fuzzy C-means (FCM) is used to cluster the input data and FDNNSA is used to detect the intention of the human [38]. The ability of feature extraction is improved by fuzzy technique by removing the redundancy through restricted Boltzmann machine (F3RBM) is developed and those features are imported into SVM which attains fast and high-precision automatic classification of dissimilar samples [39].
The proposed system enhances the contrast and preserves the edges, thereby improving the quality. The performance and accuracy of the system is increased by deep learning. The next section gives the block diagram of the proposed system.
Proposed work
The proposed system has three stages. Preprocessing is done through FADF to improve the quality of the image. Features are extracted using CNN ResNet-164 architecture and deep fuzzy classifier is used in training and classifying the images to provide better accuracy than the existing approaches. Figure 1 illustrates the overall block diagram of the proposed system.

Block diagram of the proposed system.
FADF removes the noise from the input image without removing the significant parts of the image content. Edges, lines or other details important for the understanding of the images are preserved. FADF is an extension of ADF is also called an extended PM model invented by Perona-Malik in 1990 [34] and it is a non-linear diffusion filter used in spatial regularization approaches. In ADF edges are preserved based on a single constant value gradient parameter. If the pixel value is greater than the parameter those edge are preserved. But when there is an overlapping exists between edge and noise then this condition will not work as efficient for the given image. Hence FADF is proposed for edge preservation and despekcle the noise.
Here, FADF is performed in three steps has been done such as image preprocessing, fuzzy inference system and diffusion iteration. Image preprocessing is divided into edge detection and noise removal. Edge detection is a two-step process. Such as pre edge detection is done through fuzzy contrast enhancement technique and post edge detection is done by morphology gradient operation to improve the performance of the edge detection technique. Noise detection is done using adaptive median filter. The obtained membership value from edge detection and noise removal is used in fuzzy inference system.
Fuzzy preprocessing
A. Pre-edge detection using fuzzy contrast equalization
Fuzzy contrast equalization is used to improve the brightness of the input image based on the intensity values such as low, medium and high which are said to be membership function and A,B, and C are the fuzzy rule applied to improve the contrast of the input image. Initially set the fuzzy limit based on membership function then apply the fuzzy rules and finally defuzzification is done to convert the linguistic data into the required crips data. Figure 2 shown the contrast enhanced image after performing the following fuzzy rules,

Contrast enhancement using fuzzy inference system.
Fuzzy Rules for the membership function as follows,
Next the resultant output image is given as input to edge detection using fuzzy logic.
In the above Equation (1) M×N is number pixel in the image. Where g is the gray term based on the brightness, g mn is the intensity value of (m,n)th pixel value and μ mn is the membership value to enhance the brightness of the image.
B. Edge detection using fuzzy logic

Edge detection using fuzzy rules.

Result obtained using morphological operation.
In Equation 2, FEdge is used for calculating final pixel classification as edge pixel, where α i are the fuzzy sets associated with the antecedent part of the fuzzy rule base, y-l is the output class center and M is the number of fuzzy rules being considered.
C. Post edge detection using Fuzzy morphological gradient
Fuzzy morphological gradient is done to smooth the detected edge. The morphological closing operation is done by erosion operation followed by dilation operation which smooth’s the edges by several iterations. Based upon the different dilation depth the iteration is taken place. Morphological smoothing is done by the opening operation followed by closing operation which removes the dark and bright artifact of noise. Here the dark and bright are the membership function. Figure 4 illustrate the image obtained after performing fuzzy morphological gradient operation. The fuzzy logic X and Y is follows,
D. Noise Removal
The noise region is removed using fuzzy inference system to compute the diffusion coefficient which is highly strong at noisy region rather than the smooth region. The degree of noise region is calculated by standard deviation from each noisy pixel intensity and local mean of the neighborhood.
The obtained edge and noise information is given as input to the fuzzy inference system which in turn map with the output diffusion coefficient after performing fuzzy logic. The logic behind this is if the edge information is available then no need to perform smooth operation, else if there is no edge information is avail then it seem that there is an existence of noise information hence it required smoothing operation. Finally defuzzification is done get single value as output form FIS. Edge and noise are membership function between the interval 0 and 1. Whereas Fx and Fy are fuzzy logic as follows,
Diffusion coefficient iteration
The resultant Defuzzification output is fuzzy coefficient used to control the diffusion coefficient during iteration approach. The gradient and degree of edge and noise is used to control the speed of iteration approach. When the number of iteration is less then there is a loss of information in the image whereas when there is more iteration which in turn gives more information about the image.
Convolution neural network (CNN)
The obtained edge and contrast features are given as input to the CNN architecture consisting of residual learning (ReLU) layer, convolution layer, pooling layer, and fully connected layer. Residual learning is an extension of the plain layer VGG-34 [21]. ResNet layer improves the accuracy from significantly increased depth which is greater than the existing model.
ResNet-164 model architecture
When the numbers of layers are increased, degradation occurs. But overlapping and over fitting prevent degradation. Hence deep residual network layer address the degradation problem by using shortcut layer. So in this approach the identity mapping is done as shortcut between each layer. Figure 5 illustrates the building block of the residual network layer. X is the identity mapping operation done as shortcut in between two layers. It is easy to compute and provides better error rate than the existing VGG-34 layer.

Building block of Residual Learning.
The only difference between the VGG-34 and residual learning is the addition of shortcut in between two layers. The shortcut plays vital role in ResNet layer based on the dimensionality of the input and output image. If the dimensionality is equal at source and destination of the image then the shortcut is identical vector. When the dimensionality is increased the shortcut is taken in two ways. The identity operation is performed along with zero padding operation which leads to increase in the dimension and no extra parameter is added. Second the projection shortcut operation is performed to map dimensionality vector. For both shortcuts the common stride value is taken as 2.
In Fig. 3, F (X) + X performs the feed forward neural network operation with shortcut connection. But in this work shortcut operation performs the identity mapping. The building block is defined using Equation (3).
Where x and y are the input and output vector of the layer. The function F (x, { Wi }) represents the residual mapping to be learned. The dimensionality of F and X must be equal in residual network layer. When there is a mismatch in dimension, the shortcut is introduced by performing the linear projection Ws in Equation (4). This matches the dimensionality of the input and output model.
The Fig. 7 illustrates the ResNet architecture for different input images which consists of 19 parameter layers with stride value of 2. For each layer shortcut operation is performed to map the dimensionality vector. This reduces degradation. Each shortcut arrow consists of two weight values with one ReLu value. This shortcut implementation does not incur cost and time. When the numbers of layers are increased from ResNet-152 to ResNet-164 better accuracy is achieved than the existing model. Error rate is stable from top-3 to top-5. The following Fig. 6 illustrates the structure of ResNet164-layer architecture.

ResNet-164 Architecture.

ResNet architecture for different input images with 19 parameter layers and shortcut for each layer.
An adaptive neuro-fuzzy inference system is a type of artificial neural network that is based on Takagi–Sugeno fuzzy inference system. Since, it is the combination of both neural networks and fuzzy logic principles. It has the ability to identify the advantages of both in a single framework. It consists of set of fuzzy IF–THEN rules. Hence, ANFIS is used to classify the features obtained from the ResNet-164 architecture.
The ANFIS Deep classifier produces better result than the existing machine learning algorithms such as SVM, K-NN. The next section discusses the result and compares with existing works.
Experimentation and result
The results are compared based on the accuracy, recognition rate and error rate.
Results based on accuracy
The proposed DCNN ResNet-164 architecture model provides better accuracy rate when compared to the existing approaches [23, 27]. The authors [27] achieved an accuracy rate of 90.58%, 90.02% and 90.58% for the Yale-B dataset with illuminated images, JAFFE with pose variation images and CMU-PIE with expression variation images. This work has enhanced the accuracy by 95.62%, 97.51% and 97.23%. Biao Yang et al. [23] achieved an accuracy of 94.98% and 90.86% on CK+ and JAFFE dataset using CNN-VGG. Our approach achieved an accuracy of 98.24% and 97.51 % on CK+ and JAFFE. Table 1 also gives the accuracy obtained by CNN [23], and gACNN [30] deep learning models. Accuracy of the proposed system is calculated using the following Equation (5).
Accuracy Rate Obtained from Different Deep Learning Models
Top-1 error rate implies that the target class will be the first search prediction. Top-5 error rate implies that the target class will be anywhere in the first five search predictions. Table 2 compares the top-1 and top-5 error rate. The ResNet model is tested and validated according to the error rate obtained. K. He et al. [6] achieved 21.43% top-1 error rate and 5.71% top-5 error rate. This approach reduced the top-1 error rate to 18.8 % and top-5 error rate to 3.03% for the training dataset. The number of layers was increased along with shortcut parameter resulting in reducing the top-5 error rate to 2.68%. Equation (6) is used for calculating error rate.
Error Rate (%) from Top-1 to Top-5 for Validation
The authors [22] used VGG classifier and compared with the proposed DCNN Res-Net-164 architecture model on Extended Yale-B and CMU-PIE dataset. In the proposed work along with the ResNet, the preprocessing model improves the recognition rate of the system. Anisotropy diffusion filter is used initially to remove the noise from the image and to give better quality by preserving edges from the image. The preprocessed image gives better performance than the original image. Equation (7) is used in calculating the recognition rate.
The existing VGG approach [22] on original image provided a recognition rate of 65.69% and 95.91% on Yale-B and CMU-PIE dataset. Our approach increased the recognition rate by 95.62 % and 97.23% for the original image. The preprocessed image resulted in 85.86% and 96.98% recognition rate using VGG classifier [22]. Our approach enhanced the recognition rate by 96.35% and 99.18%.
Recognition Rate Based On Classifier and Preprocessing
Table 4 illustrates the accuracy of classification obtained by several existing methods. In this work the training job is run around 450 iteration using GPU Tesla V100 using 16 GB. Least possible 8 GB memory is sufficient for training any deep learning system. CNN is used for feature extraction and produces an accuracy of 98.05 % on Face 94 dataset, 94.56 % on Face 95 dataset, 95.23% on Face 96 dataset and 99.26 % on Grimace dataset, which is higher than the existing methods proposed by author [28] and the other approaches [31–33].
Accuracy Based on Standard Dataset
Accuracy Based on Standard Dataset
Based on the different dataset used in this experiment the following confusion matrix is constructed to analyses the performance of the fuzzy deep learning classification. The calculation of the confusion matrix is depending upon the precession, recall and accuracy of the recognition rate in which actual outcome with respect to the expected outcome. Accuracy, as given by Equation (5), is how close the measured value is to the actual value. Precision, which is given by Equation (8), is the real value obtained by the system. The recall is the relevant information gathered in the measured value and is given by Equation (9). True positive tp, implies that the correct face image is identified as the correct image. True negative tn, implies that the incorrect face image is identified as the incorrect image, false positive fp, implies that the correct face image is identified as the incorrect image, and false negative fn, implies that the incorrect face image is identified as the correct image. Figure 8 depicts the confusion matrix for fuzzy deep learning approach.

Confusion matrix based on fuzzy deep learning.
FERET dataset consists of 856 faces and it contains 2413 facial images under different poses. The extended Yale-B dataset has 16128 images with 9 poses and 64 illumination conditions of 28 individuals. In this work, the extended Yale-B dataset is used. In training, 252 images of 9 poses and maximum illuminated images under 45° illumination conditions resulting in 1260 images are used. The other images are used for testing. The JAFFE dataset is also used; 251 images with 7 different expression variations from the JAFFE dataset is used for experimentation. The seven different expression variations are happiness, sadness, fear, disgust, anger, contempt and surprise. 7500 images from CMU-PIE dataset where used and 327 images representing expression variation from CK+ dataset. Along with this 2660 images from Face 94, 95, 96 and Grimace dataset are used in result comparison.
Conclusion and future work
Fuzzy Deep Convolution neural network based face recognition system is addressed in this work. Fuzzy based Anisotropy diffusion filter removes noise to give better clarity to image in the preprocessing stage. The proposed model improves the accuracy rate without any loss of information. Compared to the existing techniques, the proposed work outperforms with respect to recognition rate, accuracy, Top-1 and Top-5 error rate. Obtained features are classified using deep ANFIS classifier where results are better than the existing machine learning approaches. A Huge dataset is trained and tested using deep learning approach which provides outstanding result than the existing works. As future work, better preprocessing and segmentation algorithms are needed for occluded and partial occluded images. New features can be formulated. 3D images can also be trained and tested in future. Fuzzy recurrent neural network (FRNN) will be used to reduce the time and space by implementing long short-term memory (LSTM) architecture.
