Abstract
Face Expression Recognition (FER) has gained very much attraction to researchers in the field of computer vision because of its major usefulness in security, robotics, and HMI (Human-Machine Interaction) systems. We propose a CNN (Convolutional Neural Network) architecture to address FER. To show the effectiveness of the proposed model, we evaluate the performance of the model on JAFFE dataset. We derive a concise CNN architecture to address the issue of expression classification. Objective of various experiments is to achieve convincing performance by reducing computational overhead. The proposed CNN model is very compact as compared to other state-of-the-art models. We could achieve highest accuracy of 97.10% and average accuracy of 90.43% for top 10 best runs without any pre-processing methods applied, which justifies the effectiveness of our model. Furthermore, we have also included visualization of CNN layers to observe the learning of CNN.
Introduction
Face expressions are most general assets to interpret human’s emotional state. For example, we can better guess person’s current mood by observing his/her facial expressions [1] rather than a telephonic talk. Face Expression Recognition (FER) has vast applications in the fields of robotics [2] and security. The prime reason behind the impulsion of research in FER is due to the rapid evolution and need of automated machines. HMI (Human-Machine Interaction) systems are also one of the most rapid developing businesses [3]. All these applications require accurate recognition of facial expressions.
Traditionally, FER was performed by extracting features from images manually. These features are given as input to the model. Some popular models are MLP (Multi-layer Perceptron Model), SVM (Support Vector Machines) and kNN (k-Nearest Neighbors). These models require handcrafted features like LBP (Local Binary Patterns) [4], Eigenfaces [5], Gabor Features [6], and Face Landmark Features [7]. However, nowadays, FER is getting automated by CNN [8], DBN (Deep Belief Networks) [9], and RNN (Recurrent Neural Networks). CNNs have achieved marvelous popularity in the field of computer vision [10] because of their accurate results as compared to traditional methods. Moreover, we do not require to extract features from images because CNNs extract and learn those features automatically.
Basic CNN model for FER.
Researchers have applied CNN for face expression recognition. However, CNNs have some disadvantages like higher computation cost and requirement of a large dataset. We address both of these detriments in our proposed approach and come up with a CNN architecture that is as concise as possible and still can achieve comparable accuracy. To achieve it, we conduct different experiments by selecting various parameters such as the number of convolutional, pooling, and fully connected layers; kernel size; and the number of neurons in fully connected layer. Our proposed CNN model comprises only 7 layers (3 convolutional, 3 pooling, and 1 fully connected); thus, the proposed architecture enables faster learning. We obtain promising performance by taking minimum number of hyper-parameters which led us to effective computation and time optimization. Specifically, our proposed CNN model has obtained highest accuracy of 97.10%.
This article is divided in five sections. Section 2 presents essential background and literature related to FER. The proposed CNN-based approach is described in Section 3. Section 4 presents results of different experiments and comparison of obtained results with other research works. Conclusion and future work are presented in Section 5.
Evolution of automated systems has greatly increased in the last decade. Specifically, CNNs are becoming very popular for object detection and pattern recognition tasks. CNN contains mainly three types of layer: Convolutional layer, Pooling layer, and Fully Connected layer. Figure 1 shows the basic structure of CNN. The convolutional layer performs convolution operation on the input data by pre-defined number of kernels and kernel size. These kernels create feature maps from input data. ReLU (Rectified Linear Unit) activation is widely used with convolutional layer to exclude negative values of generated feature maps. Convolution kernel size is selected in odd numbers (e.g., 3
In recent times, face expression recognition has been attempted by researchers using Convolutional Neural Network. However, there exist some interesting works based on traditional feature based approach. For example, Stathopoulou and Tsihrintzis in [12] depicted various requirements of successful emotion recognition, which includes face databases, feature engineering, and image preprocessing. Their work [12] also discussed how human motion and gesture can become useful in emotion recognition. Brahmbhatt et al. in [7] worked on various features on human face images. A human face contains many features such as eyes, nose, lips, moustache, and beard, whose detection and extraction can become useful for face related processing. Discussion of various methods of extraction of such human face features and their analysis are available in a concise survey in [7].
In feature based approach, face detection is an important step before performing face expression recognition. The work in [13] proposed two neural networks of different complexities for face detection in which one model works on full face image whereas the second model focuses on specific characteristics of the face. Their work [13] also observed the effect of slightly rotating face images. Recently, Prajapati et al. [14] carried out work on extracting various face parts from frontal pose images of Indian people. Their work [14] first detected the face, then extracted the face by building the face boundary, and then they extracted various face parts such as eyes, nose, lips, mustache, and beard.
Widely taken datasets for FER
Widely taken datasets for FER
Steps of the proposed approach using CNN.
Emotion recognition by combining multiple different systems has also been attempted by researchers. For example, Stathopoulou et al. in [15] presented an interesting work on emotion recognition using two systems: (i) emotion recognition done based on visual face features and (ii) emotion recognition done based on the keyboard-strokes that users type under different emotion states. Their work [15] also discussed possibilities of combining outcomes of these two systems. They used feature driven, neural network based architecture to recognize six emotions, namely happiness, sadness, surprise, anger, disgust, and neutral. Another work in [16] conducted bimodal approach and presented an empirical investigations on combining audio-lingual and visual-lingual modalities for affect recognition, which means recognizing emotion states. Their work employed multi-criteria decision making theory to recognize six emotion states: happiness, sadness, surprise, anger, disgust, and normal.
As compared to traditional feature based approach, deep learning based models are relatively new. Though, some researchers [17, 18, 19] have achieved promising results for FER problem, but their CNN architectures are computationally expensive. Ucar [18] proposed a CNN architecture which consists of 9 layers (5 convolutional, 3 pooling, and 1 fully connected). Similarly, Cai et al. [19] also proposed a large model which consists of 11 layers (5 convolutional, 4 pooling, and 2 fully connected). Nwosu et al. [17] proposed an approach of two channel CNN in which, first channel CNN is used for extracting features of eyes and second channel is used for extracting features of mouth. Recently, some CNN-based models, e.g., [20, 21, 22] and [23], have been evaluated on JAFFE dataset [24]. In this article, our objective is to propose a concise CNN architecture for face expression recognition; the model should be as simple as possible and still should be able to achieve comparable accuracies as compared to these state-of-the-art works on CNN. Discussion of these research works and comparison of our proposed work with those are presented in Section 4.
The major contribution of this article is a concise architecture of CNN to solve the problem of FER. For that, we have conducted an intense background study of this domain by surveying different CNN based research works proposed by various researchers. After observing the recent research works, we finalized JAFFE dataset [24], which is one of widely used datasets to evaluate FER problem. We have also performed data augmentation in order to get sufficient number of images to train the model. We trained and tested CNN model multiple times with considering different variants, and we also observed the behavior of CNN models. Figure 2 presents the proposed approach.
There are multiple widely used datasets available for evaluation and comparison purposes, from which we chose JAFFE dataset [24] to evaluate our model. Table 1 shows characteristics of most widely used datasets for FER.
Classification accuracy obtained on test set for different architectures of CNN
Classification accuracy obtained on test set for different architectures of CNN
JAFFE dataset contains 213 images; however, they are not sufficient to train deep learning model like CNN. To address this issue, we performed data augmentation to obtain promising number of images to train the model. CNN requires various parameters to be tuned in order to get promising result. Those parameters are the following: the number of convolutional, pooling, and fully connected layer; size of convolutional and pooling kernel; neurons of fully connected layer; activation functions; learning algorithms; augmentation operations etc. CNN model in this article is derived based on different experiments and an intense survey presented in [28].
We have used the default image size of JAFFE dataset [24], which is 256
Data augmentation refers to enlargement of dataset by performing operations like rotate, sheer, zoom, brightness, rescale, horizontal flip, vertical flip, ZCA whitening, etc. We required data augmentation because JAFFE dataset contains only 213 images which are not sufficient to train the CNN model. Therefore, we have performed data augmentation by applying operations which are rotation, zoom, rescale, and horizontal flip on facial images. We observed that batch size also affects the results. Hence, after performing various experiments, summarized in Table 2, the batch size was set to 32.
We have not applied any resizing because it may lead to data loss up to some extent. We observed that CNN can learn features more precisely if the input image size is good enough to extract features. The structural parameters of CNN like total number of convolutional layers, convolution filters, kernel size, pooling kernel size, number of fully connected layers, and neurons in fully connected layers are selected after performing analysis of research works carried out by different researchers [30, 21, 18, 31, 32]. As an outcome of analysis, we concluded that researchers have used large convolutional kernel size and pooling size if the image size is large. Similarly, the number of fully connected layers should be more and their corresponding number of neurons should also be higher. We have finalized the proposed CNN model for FER after various experiments discussed in Section 4. The structure of the proposed CNN model is presented in Fig. 3. Detailed discussion on deriving the proposed structure is presented in Section 4. There are total seven layers in the proposed CNN. Salient detail of the architecture is as follows:
Input image size of 256 Convolutional layer with 8 kernels of size 9 MaxPool layer with pool size 4 Convolutional layer with 16 kernels of size 7 MaxPool layer with pool size 4 Convolutional layer with 32 kernels of size 5 MaxPool layer with pool size 2 Dropout with rate 0.2. Fully Connected layer with 256 neurons and ReLU activation. Dropout with rate 0.3. Output layer with 7 neurons and Softmax activation.
Architecture of the proposed CNN model for Face Expression Recognition.
As shown in Fig. 3, our proposed model consists of 3 convolutional layers followed by 3 pooling layers, 1 fully connected layer, and an output layer with Softmax activation. As said earlier, we evaluated our model on JAFFE dataset [24]. The proposed model, which obtained highest accuracy, took approximately 13 minutes to train. Highest accuracy of our proposed model is 97.10%. Figure 4 shows some samples of JAFFE dataset. We split the entire dataset into train (144 images) and test (69 images) datasets.
Sample images of JAFFE dataset [24].
Augmented image samples.
Intermediate output of different layers of the proposed CNN for FER.
As described in Section 3, we carried out data augmentation by performing rotate, zoom, and rescale operations on facial images of training data in order to get sufficient number of training images. Other available data augmentation techniques are mentioned in Section 3. Rotation range was chosen as 40 and Zoom range as 0.3. Furthermore, rescaling factor of 1/255 and horizontal flip were applied. These values are selected to generate realistic augmented images (i.e., real-life face images cannot be rotated more than 45 degree). Figure 5 shows samples of augmented images. All scripts were written in Python v3.6 using Spyder IDE and run on NVIDIA GTX 1050Ti GPU and Intel i7-8750H @2.20 GHz CPU. On this hardware machine, it took approximate 13 minutes to train the model.
Figure 6 shows intermediate output of different layers of the proposed CNN for FER. Convolutional layers with corresponding pooling layers are presented. The figure is provided to show how CNN model automatically learns facial features.
Default image size of 256
We have trained and tested our mentioned configuration multiple times from which, top 10 best runs are described in Table 4. The highest accuracy we obtained is 97.10% and average accuracy of best 10 runs is 90.43%. Figure 7 shows the confusion matrix generated from test data of the best case. We can clearly see that only 2 images are incorrectly classified out of 69 images of test dataset.
Confusion matrix of highest result obtained.
Misclassified images of JAFFE dataset.
Figure 8 shows misclassified images. We observed that JAFFE dataset contains images having very less inter-class variation. The model was not able to correctly classify images of classes ‘Neutral’ and ‘Sadness’. As those two images have very little inter-class variation. It is very clear that even human eyes cannot easily determine that which image belongs to which class.
We started experimenting by considering 4 convolutional layers followed by 4 pooling layers. Kernel size of convolutional layers and pooling layers were taken ranging from 3
Different hyper-parameters also affect the result of the model. Table 2 shows the classification behavior on different hyper-parameters. We observed that batch size also affects the training. Steps per epoch refer to the number of images taken randomly from augmented data for training in each epoch. The value of steps per epoch is taken as 144 because only 5 images are not sufficient for every epoch as per the practice.
Steps per epoch perform major role in training. Detailed comparison on how steps per epoch affect the training is shown in Tables 3 and 4. SGD (Stochastic Gradient Descent) optimizer was also tested, but as shown in Table 2, Adam [33] performs better than SGD. We carried out experiments by varying value of steps per epoch. Table 3 shows different experimental results performed by considering various values of steps per epoch.
Results of 5 runs on different values of steps per epoch
After conducting experiments discussed in Table 2, we found that around 150 steps per epoch are sufficient to train model efficiently. Detailed information of top 10 runs on 144 steps per epoch is given in Table 4. As described, we obtained highest accuracy of 97.10%.
Results of best 10 runs
Comparison of the proposed work with other state-of-the-art work
As discussed earlier, we performed data augmentation on training images of JAFFE dataset and experimented the model by varying different hyper-parameters. After experimenting various combinations, we obtained 97.10% accuracy in the best case. Table 5 shows the comparison of evaluation between proposed method and state-of-the-art methods evaluated on JAFFE dataset. All the compared works are evaluated on JAFFE dataset on 7 mentioned expressions: Anger, Fear, Happiness, Disgust, Surprise, Sadness, and Neutral. Moreover, all the works are based on CNN.
Shan et al. [23] proposed a CNN model which consists of only two convolutional layers. As recognition of facial expression is a moderate task, CNN requires more number of layers in order to achieve better results. Similarly, CNN requires more number of neurons in the fully connected layer because of accommodating all the connections coming from previous layer. Xie et al. [22] proposed a model which consists of one fully connected layer with only 64 neurons. Lopes et al. [21] considered a large kernel size of convolutional layers despite of having a small input image. Chen et al. [20] proposed a CNN architecture having total 4 convolutional layers and 2 pooling layers. They considered 2 convolutional layers consecutively followed by 1 pooling layer. All convolutional layers consist of higher number of convolution filters (i.e. 96, 128, 256, 256). Cai et al. [19] and A. Ucar [18] obtained accuracy of 95.24% and 96.10%, respectively, but both of their proposed CNN architectures are computationally expensive. Similarly, Nwosu et al. [17] also proposed a high computational approach, which consists of 2 individual CNNs to extract features of mouth and eyes.
Our proposed model is a balanced architecture with accurately defined convolutional layers, convolution filters and kernel size, pooling kernel size, number of fully connected layers, and neurons. Our proposed concise model clearly outperforms all other state-of-the-art CNN models in terms of complexity of the architecture for the problem of face expression recognition.
Further research is possible in the following directions: image quality and side face images. The quality of face image can affect the performance of face expression recognition. About image quality of face images, two issues can be explored: (1) image resolution and (2) presence of noise. The image resolution can be experimentally explored by changing the dimensions of the first layer of the CNN. If some noise is present in the images, which is a possible scenario, the images can be passed through an image processing pipeline. The dataset of our study, i.e., JAFFE dataset, did not require any noise removal technique. However, noisy images can be prepared by adding some noise that follows some distribution and then images can be filtered before they are fed to CNN.
In this article, we studied face expression recognition from frontal face images. However, it is possible that side images may provide some useful features which may not be apparent from straight images of faces, but such information from side images can be used as additional information. If we consider only side faces, we may lose essential information on face expression. Therefore, face expression recognition model can be studied that combines classifications done by multiple independent models that work on separate images, i.e., frontal face images and side images. There is a need of such dataset that contains images of face expressions captured by multiple cameras from different angles. Combining of multiple models can be explored using techniques of ensemble learning.
In this article, we have proposed a CNN architecture for facial expression recognition. We have described the overview of CNN architecture which can help beginners to understand CNN. The major challenge was to suggest an appropriate CNN architecture which is cost effective and which can perform comparatively better than state-of-the-art CNN models proposed by researchers. We resolved this challenge by performing data augmentation on training data and experimented various combinations of CNN architectures. One another challenge was to classify test images with higher accuracy. Therefore, the model we create had to be well structured with optimized hyper-parameters.
We used JAFFE dataset to perform various experiments related to our proposed work. We compared our proposed model with other state-of-the-art models, and the results clearly show that our proposed model is concise among all other models. We have obtained highest accuracy of 97.10% and average accuracy of 90.43% of best 10 runs of our defined model. We have also reported the behavior of CNN model by experimenting different hyper-parameters. Visualization of intermediate CNN layers are also included.
Future experiments can be carried out to obtain the same result by decreasing image size which can facilitate lesser time for training. Basic pre-processing techniques like cropping can be applied. Moreover, increasing test images by including augmented data, cross-database evaluation, and transfer learning can also be the future work. Further research directions are also provided, which include effect of image quality on recognition model and use of side images for expression recognition.
