Concise CNN model for face expression recognition

Abstract

Face Expression Recognition (FER) has gained very much attraction to researchers in the field of computer vision because of its major usefulness in security, robotics, and HMI (Human-Machine Interaction) systems. We propose a CNN (Convolutional Neural Network) architecture to address FER. To show the effectiveness of the proposed model, we evaluate the performance of the model on JAFFE dataset. We derive a concise CNN architecture to address the issue of expression classification. Objective of various experiments is to achieve convincing performance by reducing computational overhead. The proposed CNN model is very compact as compared to other state-of-the-art models. We could achieve highest accuracy of 97.10% and average accuracy of 90.43% for top 10 best runs without any pre-processing methods applied, which justifies the effectiveness of our model. Furthermore, we have also included visualization of CNN layers to observe the learning of CNN.

Keywords

FER face expression recognition CNN JAFFE dataset deep learning convolutional neural network

1. Introduction

Face expressions are most general assets to interpret human’s emotional state. For example, we can better guess person’s current mood by observing his/her facial expressions [1] rather than a telephonic talk. Face Expression Recognition (FER) has vast applications in the fields of robotics [2] and security. The prime reason behind the impulsion of research in FER is due to the rapid evolution and need of automated machines. HMI (Human-Machine Interaction) systems are also one of the most rapid developing businesses [3]. All these applications require accurate recognition of facial expressions.

Traditionally, FER was performed by extracting features from images manually. These features are given as input to the model. Some popular models are MLP (Multi-layer Perceptron Model), SVM (Support Vector Machines) and kNN (k-Nearest Neighbors). These models require handcrafted features like LBP (Local Binary Patterns) [4], Eigenfaces [5], Gabor Features [6], and Face Landmark Features [7]. However, nowadays, FER is getting automated by CNN [8], DBN (Deep Belief Networks) [9], and RNN (Recurrent Neural Networks). CNNs have achieved marvelous popularity in the field of computer vision [10] because of their accurate results as compared to traditional methods. Moreover, we do not require to extract features from images because CNNs extract and learn those features automatically.

Figure 1.

Basic CNN model for FER.

Researchers have applied CNN for face expression recognition. However, CNNs have some disadvantages like higher computation cost and requirement of a large dataset. We address both of these detriments in our proposed approach and come up with a CNN architecture that is as concise as possible and still can achieve comparable accuracy. To achieve it, we conduct different experiments by selecting various parameters such as the number of convolutional, pooling, and fully connected layers; kernel size; and the number of neurons in fully connected layer. Our proposed CNN model comprises only 7 layers (3 convolutional, 3 pooling, and 1 fully connected); thus, the proposed architecture enables faster learning. We obtain promising performance by taking minimum number of hyper-parameters which led us to effective computation and time optimization. Specifically, our proposed CNN model has obtained highest accuracy of 97.10%.

This article is divided in five sections. Section 2 presents essential background and literature related to FER. The proposed CNN-based approach is described in Section 3. Section 4 presents results of different experiments and comparison of obtained results with other research works. Conclusion and future work are presented in Section 5.

2. Background and related work

Evolution of automated systems has greatly increased in the last decade. Specifically, CNNs are becoming very popular for object detection and pattern recognition tasks. CNN contains mainly three types of layer: Convolutional layer, Pooling layer, and Fully Connected layer. Figure 1 shows the basic structure of CNN. The convolutional layer performs convolution operation on the input data by pre-defined number of kernels and kernel size. These kernels create feature maps from input data. ReLU (Rectified Linear Unit) activation is widely used with convolutional layer to exclude negative values of generated feature maps. Convolution kernel size is selected in odd numbers (e.g., 3 $\times$ 3, 5 $\times$ 5, 7 $\times$ 7) in practice. Then, pooling layer is used to reduce dimensions of feature maps generated by convolutional layer. Mostly, maximum pooling (MaxPool) is used to obtain only useful information and to reduce dimensions of feature maps. In practice, the kernel size of pooling is selected in even numbers (e.g., 2 $\times$ 2, 4 $\times$ 4). Finally, all feature maps are flattened and fed into the fully connected layer. This layer behaves just like the traditional Multi-Layer Neural Network. It performs matrix multiplication of its neurons’ biases with weights of previous layer. At last, output layer consists of neurons as per the number of classes. CNNs need a large number of images for training. Sometimes, lots of data lead to overfitting. To avoid overfitting, batch normalization and Dropout methods [11] are used.

In recent times, face expression recognition has been attempted by researchers using Convolutional Neural Network. However, there exist some interesting works based on traditional feature based approach. For example, Stathopoulou and Tsihrintzis in [12] depicted various requirements of successful emotion recognition, which includes face databases, feature engineering, and image preprocessing. Their work [12] also discussed how human motion and gesture can become useful in emotion recognition. Brahmbhatt et al. in [7] worked on various features on human face images. A human face contains many features such as eyes, nose, lips, moustache, and beard, whose detection and extraction can become useful for face related processing. Discussion of various methods of extraction of such human face features and their analysis are available in a concise survey in [7].

In feature based approach, face detection is an important step before performing face expression recognition. The work in [13] proposed two neural networks of different complexities for face detection in which one model works on full face image whereas the second model focuses on specific characteristics of the face. Their work [13] also observed the effect of slightly rotating face images. Recently, Prajapati et al. [14] carried out work on extracting various face parts from frontal pose images of Indian people. Their work [14] first detected the face, then extracted the face by building the face boundary, and then they extracted various face parts such as eyes, nose, lips, mustache, and beard.

Table 1
Widely taken datasets for FER

Dataset	Facial expressions	# of subjects	# of images/videos	Color/gray	Resolution	Type
CK $+$ [25]	8 (neutral, sadness, surprise, happiness, fear, anger, contempt, and disgust)	123	593 images	Mostly gray	640 $\times$ 490	Posed, spontaneous
JAFFE [24]	7 (neutral, sadness, surprise, happiness, fear, anger, and disgust)	10	213 images	Gray	256 $\times$ 256	Posed
FER-13 [26]	7 (happy, disgusted, fearful, angry, sad, surprised, and neutral)	N/A	35887 images	Gray	48 $\times$ 48	Spontaneous
Oulu-CASIA [27]	6 (surprise, happiness, sadness, anger, fear, and disgust)	80	480 images, 2880 videos	Color	320 $\times$ 240	Posed

Figure 2.

Steps of the proposed approach using CNN.

Emotion recognition by combining multiple different systems has also been attempted by researchers. For example, Stathopoulou et al. in [15] presented an interesting work on emotion recognition using two systems: (i) emotion recognition done based on visual face features and (ii) emotion recognition done based on the keyboard-strokes that users type under different emotion states. Their work [15] also discussed possibilities of combining outcomes of these two systems. They used feature driven, neural network based architecture to recognize six emotions, namely happiness, sadness, surprise, anger, disgust, and neutral. Another work in [16] conducted bimodal approach and presented an empirical investigations on combining audio-lingual and visual-lingual modalities for affect recognition, which means recognizing emotion states. Their work employed multi-criteria decision making theory to recognize six emotion states: happiness, sadness, surprise, anger, disgust, and normal.

As compared to traditional feature based approach, deep learning based models are relatively new. Though, some researchers [17, 18, 19] have achieved promising results for FER problem, but their CNN architectures are computationally expensive. Ucar [18] proposed a CNN architecture which consists of 9 layers (5 convolutional, 3 pooling, and 1 fully connected). Similarly, Cai et al. [19] also proposed a large model which consists of 11 layers (5 convolutional, 4 pooling, and 2 fully connected). Nwosu et al. [17] proposed an approach of two channel CNN in which, first channel CNN is used for extracting features of eyes and second channel is used for extracting features of mouth. Recently, some CNN-based models, e.g., [20, 21, 22] and [23], have been evaluated on JAFFE dataset [24]. In this article, our objective is to propose a concise CNN architecture for face expression recognition; the model should be as simple as possible and still should be able to achieve comparable accuracies as compared to these state-of-the-art works on CNN. Discussion of these research works and comparison of our proposed work with those are presented in Section 4.

3. Proposed approach

The major contribution of this article is a concise architecture of CNN to solve the problem of FER. For that, we have conducted an intense background study of this domain by surveying different CNN based research works proposed by various researchers. After observing the recent research works, we finalized JAFFE dataset [24], which is one of widely used datasets to evaluate FER problem. We have also performed data augmentation in order to get sufficient number of images to train the model. We trained and tested CNN model multiple times with considering different variants, and we also observed the behavior of CNN models. Figure 2 presents the proposed approach.

There are multiple widely used datasets available for evaluation and comparison purposes, from which we chose JAFFE dataset [24] to evaluate our model. Table 1 shows characteristics of most widely used datasets for FER.

Table 2
Classification accuracy obtained on test set for different architectures of CNN

Approximate classification accuracy	CNN architecture [layer name, kernel size, # of kernel]	Batch size	Optimizer
97%	[Conv1 ${}^{\rm a}$ , (9,9), 8)], [MaxPool (4,4)], [Conv2, (7,7), 16], [MaxPool (4,4)], [Conv3, (5,5), 32], [MaxPool (2,2)], [Dropout (0.2)], [FC ${}^{\rm b}$ (256)], [Dropout (0.3)], [Output]	32	Adam
87%	[Conv1, (9,9), 8)], [MaxPool (4,4)], [Conv2, (7,7), 16], [MaxPool (4,4)], [Conv3, (5,5), 32], [MaxPool (2,2)], [Dropout (0.2)], [FC (256)], [Dropout (0.3)], [Output]	32	SGD ${}^{\rm c}$
93%	[Conv1, (9,9), 8)], [MaxPool (4,4)], [Conv2, (7,7), 16], [MaxPool (4,4)], [Conv3, (5,5), 32], [MaxPool (2,2)], [Dropout (0.2)], [FC (256)], [Dropout (0.3)], [Output]	64	Adam
59%	[Conv1, (9,9), 8)], [MaxPool (4,4)], [Conv2, (7,7), 16], [MaxPool (4,4)], [Conv3, (5,5), 32], [MaxPool (2,2)], [Dropout (0.2)], [FC (256)], [Dropout (0.3)], [Output]	64	SGD
91%	[Conv1 (9,9), 16], [MaxPool (4,4)], [Conv2, (7,7), 32], [MaxPool (4,4)], [Conv3, (5,5), 64], [MaxPool (2,2)], [Dropout (0.2)], [FC (128)], [Dropout (0.3)], [Output]	32	Adam
84%	[Conv1 (9,9), 8], [MaxPool (4,4)], [Conv2, (7,7), 16], [MaxPool (4,4)], [Conv3, (5,5), 32], [MaxPool (2,2)], [Dropout (0.2)], [FC1 (128)], [Dropout (0.2)], [FC2 (256)], [Dropout (0.2)], [Output]	32	Adam
78%	[Conv1 (9,9), 8], [MaxPool (4,4)], [Conv2, (7,7), 16], [MaxPool (4,4)], [Conv3, (5,5), 32], [MaxPool (2,2)], [Conv4, (3,3), 32], [MaxPool (2,2)], [Dropout (0.2)], [FC (256)], [Dropout (0.3)], [Output]	32	Adam

${}^{\rm a}$ Conv-Convolutional layer, ${}^{\rm b}$ FC-Fully connected layer, ${}^{\rm c}$ SGD-Stochastic Gradient Descent. Note: 1) Steps per epoch are 144. 2) ReLU activation is applied after every convolutional layer. 3) Softmax activation is used in output layer. 4) SGD learning rate was taken as 0.01 with no Nesterov. 5) Zero padding is applied at every convolutional and pooling layer.

JAFFE dataset contains 213 images; however, they are not sufficient to train deep learning model like CNN. To address this issue, we performed data augmentation to obtain promising number of images to train the model. CNN requires various parameters to be tuned in order to get promising result. Those parameters are the following: the number of convolutional, pooling, and fully connected layer; size of convolutional and pooling kernel; neurons of fully connected layer; activation functions; learning algorithms; augmentation operations etc. CNN model in this article is derived based on different experiments and an intense survey presented in [28].

We have used the default image size of JAFFE dataset [24], which is 256 $\times$ 256. JAFFE is very popular dataset in the field of FER and many researchers including [23, 22, 21, 20, 19, 18, 17] have evaluated and compared their research work on the JAFFE dataset. The dataset consists of 213 images of total 10 Japanese women. Shaver et al. [29] concluded that only 7 expressions: Anger, Fear, Happiness, Disgust, Surprise, Sadness, and Neutral are prototype expressions. JAFFE dataset contains images of all these basic 7 expressions.

Data augmentation refers to enlargement of dataset by performing operations like rotate, sheer, zoom, brightness, rescale, horizontal flip, vertical flip, ZCA whitening, etc. We required data augmentation because JAFFE dataset contains only 213 images which are not sufficient to train the CNN model. Therefore, we have performed data augmentation by applying operations which are rotation, zoom, rescale, and horizontal flip on facial images. We observed that batch size also affects the results. Hence, after performing various experiments, summarized in Table 2, the batch size was set to 32.

We have not applied any resizing because it may lead to data loss up to some extent. We observed that CNN can learn features more precisely if the input image size is good enough to extract features. The structural parameters of CNN like total number of convolutional layers, convolution filters, kernel size, pooling kernel size, number of fully connected layers, and neurons in fully connected layers are selected after performing analysis of research works carried out by different researchers [30, 21, 18, 31, 32]. As an outcome of analysis, we concluded that researchers have used large convolutional kernel size and pooling size if the image size is large. Similarly, the number of fully connected layers should be more and their corresponding number of neurons should also be higher. We have finalized the proposed CNN model for FER after various experiments discussed in Section 4. The structure of the proposed CNN model is presented in Fig. 3. Detailed discussion on deriving the proposed structure is presented in Section 4. There are total seven layers in the proposed CNN. Salient detail of the architecture is as follows:

11. 1.

Input image size of 256 $\times$ 256 pixels.

Convolutional layer with 8 kernels of size 9 $\times$ 9 with zero padding and ReLU activation.

MaxPool layer with pool size 4 $\times$ 4 with zero padding.

Convolutional layer with 16 kernels of size 7 $\times$ 7 with zero padding and ReLU activation.

MaxPool layer with pool size 4 $\times$ 4 with zero padding.

Convolutional layer with 32 kernels of size 5 $\times$ 5 with zero padding and ReLU activation.

MaxPool layer with pool size 2 $\times$ 2 with zero padding.

Dropout with rate 0.2.

Fully Connected layer with 256 neurons and ReLU activation.

10.

Dropout with rate 0.3.

11.

Output layer with 7 neurons and Softmax activation.

Figure 3.

Architecture of the proposed CNN model for Face Expression Recognition.

4. Experiments, results, and discussion

As shown in Fig. 3, our proposed model consists of 3 convolutional layers followed by 3 pooling layers, 1 fully connected layer, and an output layer with Softmax activation. As said earlier, we evaluated our model on JAFFE dataset [24]. The proposed model, which obtained highest accuracy, took approximately 13 minutes to train. Highest accuracy of our proposed model is 97.10%. Figure 4 shows some samples of JAFFE dataset. We split the entire dataset into train (144 images) and test (69 images) datasets.

Figure 4.

Sample images of JAFFE dataset [24].

Figure 5.

Augmented image samples.

Figure 6.

Intermediate output of different layers of the proposed CNN for FER.

As described in Section 3, we carried out data augmentation by performing rotate, zoom, and rescale operations on facial images of training data in order to get sufficient number of training images. Other available data augmentation techniques are mentioned in Section 3. Rotation range was chosen as 40 and Zoom range as 0.3. Furthermore, rescaling factor of 1/255 and horizontal flip were applied. These values are selected to generate realistic augmented images (i.e., real-life face images cannot be rotated more than 45 degree). Figure 5 shows samples of augmented images. All scripts were written in Python v3.6 using Spyder IDE and run on NVIDIA GTX 1050Ti GPU and Intel i7-8750H @2.20 GHz CPU. On this hardware machine, it took approximate 13 minutes to train the model.

Figure 6 shows intermediate output of different layers of the proposed CNN for FER. Convolutional layers with corresponding pooling layers are presented. The figure is provided to show how CNN model automatically learns facial features.

Default image size of 256 $\times$ 256 was considered in the evaluation. No resizing was applied on images to make CNN learn more intensely from images and to prevent data loss. The model was trained for 50 epochs. Adam optimizer [33] was used with learning rate as 0.001 throughout the experiment because Adam performs better than Gradient Descent [34] in many machine learning applications such as [35, 36, 37]. Dropout method [11] is also used to avoid overfitting, which randomly drops connections from previous layer. Many researchers such as [22, 21], and [38] found dropout method very useful in their research works. The values of Adam learning rate and number of epochs were selected after certain observations of training loss.

We have trained and tested our mentioned configuration multiple times from which, top 10 best runs are described in Table 4. The highest accuracy we obtained is 97.10% and average accuracy of best 10 runs is 90.43%. Figure 7 shows the confusion matrix generated from test data of the best case. We can clearly see that only 2 images are incorrectly classified out of 69 images of test dataset.

Figure 7.

Confusion matrix of highest result obtained.

Figure 8.

Misclassified images of JAFFE dataset.

Figure 8 shows misclassified images. We observed that JAFFE dataset contains images having very less inter-class variation. The model was not able to correctly classify images of classes ‘Neutral’ and ‘Sadness’. As those two images have very little inter-class variation. It is very clear that even human eyes cannot easily determine that which image belongs to which class.

We started experimenting by considering 4 convolutional layers followed by 4 pooling layers. Kernel size of convolutional layers and pooling layers were taken ranging from 3 $\times$ 3 to 11 $\times$ 11 and 2 $\times$ 2 to 4 $\times$ 4. Then, we went further by reducing number of convolutional layers and other tuning of hyper-parameters. Detailed comparison of different architectural hyper-parameters is done in Table 2. Fully connected layers and their corresponding neurons were tested ranging from 1–2 and 128–1024. After certain experiments, we found that CNN requires only 3 convolutional layers followed by 3 pooling layers for the problem of frontal face expression recognition. Dropout rate was taken after experimenting with considering different rates ranging from 0.1–0.4. Softmax activation function was used at output layer because it normalizes input vectors into probability distribution so that a model can predict output based on corresponding probability of each class. We also observed test accuracy by considering just 2 convolutional layers but it did not perform well. Therefore, we finalized our CNN architecture as described in Section 3.

Different hyper-parameters also affect the result of the model. Table 2 shows the classification behavior on different hyper-parameters. We observed that batch size also affects the training. Steps per epoch refer to the number of images taken randomly from augmented data for training in each epoch. The value of steps per epoch is taken as 144 because only 5 images are not sufficient for every epoch as per the practice.

$\displaystyle\text{Steps per epoch}=\frac{\text{Total training images}}{\text{% Batch size}}$

Steps per epoch perform major role in training. Detailed comparison on how steps per epoch affect the training is shown in Tables 3 and 4. SGD (Stochastic Gradient Descent) optimizer was also tested, but as shown in Table 2, Adam [33] performs better than SGD. We carried out experiments by varying value of steps per epoch. Table 3 shows different experimental results performed by considering various values of steps per epoch.

Table 3

Results of 5 runs on different values of steps per epoch

Epochs	Steps per epoch	Test accuracy	Average test accuracy	Time taken per one complete run
50	100	75.36%	85.21%	$\sim$ 9 minutes
		91.30%
		73.91%
		95.65%
		89.85%
	200	86.95%	85.48%	$\sim$ 17 minutes
		79.57%
		81.16%
		86.96%
		92.75%
	300	92.75%	81.34%	$\sim$ 26 minutes
		86.96%
		86.96%
		76.26%
		63.77%

After conducting experiments discussed in Table 2, we found that around 150 steps per epoch are sufficient to train model efficiently. Detailed information of top 10 runs on 144 steps per epoch is given in Table 4. As described, we obtained highest accuracy of 97.10%.

Table 4

Results of best 10 runs

Epochs	Steps per epoch	Test accuracy	Average test accuracy	Time taken per one complete run
50	144	85.51%	90.43%	$\sim$ 13 minutes
		89.85%
		94.20%
		89.85%
		92.75%
		88.41%
		97.10%
		86.96%
		92.75%
		86.96%

Table 5

Comparison of the proposed work with other state-of-the-art work

Researcher	Accuracy
Shan et al. [23]	76.74%
Xie et al. [22]	86.71%
Lopes et al. [21]	86.74%
Chen et al. [20]	87.73%
Cai et al. [19]	95.24%
Ucar [18]	96.10 %
Proposed	(Highest) 97.10%
Nwosu et al. [17]	97.60%

As discussed earlier, we performed data augmentation on training images of JAFFE dataset and experimented the model by varying different hyper-parameters. After experimenting various combinations, we obtained 97.10% accuracy in the best case. Table 5 shows the comparison of evaluation between proposed method and state-of-the-art methods evaluated on JAFFE dataset. All the compared works are evaluated on JAFFE dataset on 7 mentioned expressions: Anger, Fear, Happiness, Disgust, Surprise, Sadness, and Neutral. Moreover, all the works are based on CNN.

Shan et al. [23] proposed a CNN model which consists of only two convolutional layers. As recognition of facial expression is a moderate task, CNN requires more number of layers in order to achieve better results. Similarly, CNN requires more number of neurons in the fully connected layer because of accommodating all the connections coming from previous layer. Xie et al. [22] proposed a model which consists of one fully connected layer with only 64 neurons. Lopes et al. [21] considered a large kernel size of convolutional layers despite of having a small input image. Chen et al. [20] proposed a CNN architecture having total 4 convolutional layers and 2 pooling layers. They considered 2 convolutional layers consecutively followed by 1 pooling layer. All convolutional layers consist of higher number of convolution filters (i.e. 96, 128, 256, 256). Cai et al. [19] and A. Ucar [18] obtained accuracy of 95.24% and 96.10%, respectively, but both of their proposed CNN architectures are computationally expensive. Similarly, Nwosu et al. [17] also proposed a high computational approach, which consists of 2 individual CNNs to extract features of mouth and eyes.

Our proposed model is a balanced architecture with accurately defined convolutional layers, convolution filters and kernel size, pooling kernel size, number of fully connected layers, and neurons. Our proposed concise model clearly outperforms all other state-of-the-art CNN models in terms of complexity of the architecture for the problem of face expression recognition.

Further research is possible in the following directions: image quality and side face images. The quality of face image can affect the performance of face expression recognition. About image quality of face images, two issues can be explored: (1) image resolution and (2) presence of noise. The image resolution can be experimentally explored by changing the dimensions of the first layer of the CNN. If some noise is present in the images, which is a possible scenario, the images can be passed through an image processing pipeline. The dataset of our study, i.e., JAFFE dataset, did not require any noise removal technique. However, noisy images can be prepared by adding some noise that follows some distribution and then images can be filtered before they are fed to CNN.

In this article, we studied face expression recognition from frontal face images. However, it is possible that side images may provide some useful features which may not be apparent from straight images of faces, but such information from side images can be used as additional information. If we consider only side faces, we may lose essential information on face expression. Therefore, face expression recognition model can be studied that combines classifications done by multiple independent models that work on separate images, i.e., frontal face images and side images. There is a need of such dataset that contains images of face expressions captured by multiple cameras from different angles. Combining of multiple models can be explored using techniques of ensemble learning.

5. Conclusion and future work

In this article, we have proposed a CNN architecture for facial expression recognition. We have described the overview of CNN architecture which can help beginners to understand CNN. The major challenge was to suggest an appropriate CNN architecture which is cost effective and which can perform comparatively better than state-of-the-art CNN models proposed by researchers. We resolved this challenge by performing data augmentation on training data and experimented various combinations of CNN architectures. One another challenge was to classify test images with higher accuracy. Therefore, the model we create had to be well structured with optimized hyper-parameters.

We used JAFFE dataset to perform various experiments related to our proposed work. We compared our proposed model with other state-of-the-art models, and the results clearly show that our proposed model is concise among all other models. We have obtained highest accuracy of 97.10% and average accuracy of 90.43% of best 10 runs of our defined model. We have also reported the behavior of CNN model by experimenting different hyper-parameters. Visualization of intermediate CNN layers are also included.

Future experiments can be carried out to obtain the same result by decreasing image size which can facilitate lesser time for training. Basic pre-processing techniques like cropping can be applied. Moreover, increasing test images by including augmented data, cross-database evaluation, and transfer learning can also be the future work. Further research directions are also provided, which include effect of image quality on recognition model and use of side images for expression recognition.

References

Forgas

Bower

. Mood effects on person-perception judgments. Journal of Personality and Social Psychology. 1987; 53(1): 53.

Berns

Hirth

. Control of facial expressions of the humanoid robot head Roman. in: 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems. 2006; 3119-3124.

Bartlett

Littlewort

Fasel

Movellan

. Real time face detection and facial expression recognition: Development and applications to human computer interaction. in: 2003 Conference on Computer Vision and Pattern Recognition Workshop. 2003; 5: 53-53.

Shan

Gong

, Mc Owan

. Facial expression recognition based on local binary patterns: A comprehensive study. Image and Vision Computing. 2009; 27(6): 803-816.

Turk

Pentland

. Eigenfaces for recognition. Journal of Cognitive Neuroscience. 1991; 3(1): 71-86.

Fogel

Sagi

. Gabor filters as texture discriminator. Biological Cybernetics. 1989; 61(2): 103-113.

Brahmbhatt

Prajapati

Dabhi

. Survey and analysis of extraction of human face features. in: 2017 Innovations in Power and Advanced Computing Technologies (i-PACT). 2017; 1-8.

Zavarez

Berriel

Oliveira-Santos

. Cross-database facial expression recognition based on fine-tuned deep convolutional network. in: 2017 30th SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI). 2017; 405-412.

Hinton

Osindero

Teh

. A fast learning algorithm for deep belief nets. Neural Computation. 2006; 18(7): 1527-1554.

10.

Krizhevsky

Sutskever

Hinton

. Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems. 2012; 25: 1097-105.

11.

Srivastava

Hinton

Krizhevsky

Sutskever

Salakhutdinov

. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research. 2014; 15(1): 1929-1958.

12.

Ioanna-Ourania

Tsihrintzis

. Visual affect recognition. Frontiers in Artificial Intelligence and Applications. 2010.

13.

Stathopoulou

Tsihrintzis

. Appearance-based face detection with artificial neural networks. Intelligent Decision Technologies. 2011; 5(2): 101-111.

14.

Prajapati

Brahmbhatt

Dabhi

. Extraction of human face features from color images. Intelligent Decision Technologies. 2019; 13(1): 67-80.

15.

Stathopoulou

Alepis

Tsihrintzis

Virvou

. On assisting a visual-facial affect recognition system with keyboard-stroke pattern information. Knowledge-Based Systems. 2010; 23(4): 350-356.

16.

Virvou

Tsihrintzis

Alepis

Stathopoulou

Kabassi

. Emotion recognition: Empirical studies towards the combination of audio-lingual and visual-facial modalities through multi-attribute decision making. International Journal on Artificial Intelligence Tools. 2012; 21(2): 1240001.

17.

Nwosu

Wang

Unwala

Yang

Zhang

. Deep convolutional neural network for facial expression recognition using facial parts. in: 2017 IEEE 15th Intl Conf on Dependable, Autonomic and Secure Computing, 15th Intl Conf on Pervasive Intelligence and Computing, 3rd Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress (DASC/PiCom/DataCom/CyberSciTech). 2017; 1318-1321.

18.

Uçar

. Deep Convolutional Neural Networks for facial expression recognition. in: 2017 IEEE International Conference on Innovations in Intelligent SysTems and Applications (INISTA). 2017; 371-375.

19.

Cai

Chang

Tang

Xue

Wei

. Facial expression recognition method based on sparse batch normalization CNN. in: 2018 37th Chinese Control Conference (CCC). 2018; 9608-9613.

20.

Chen

Yang

Wang

Zou

. Convolution neural network for automatic facial expression recognition. in: 2017 International Conference on Applied System Innovation (ICASI). 2017; 814-817.

21.

Lopes

, de Aguiar

, De Souza

Oliveira-Santos

. Facial expression recognition with convolutional neural networks: Coping with few data and the training sample order. Pattern Recognition. 2017; 61: 610-628.

22.

Xie

Wang

Cai

Rao

Liu

. Convolutional neural networks for facial expression recognition with few training samples. in: 2018 37th Chinese Control Conference (CCC). 2018; 9540-9544.

23.

Shan

Guo

You

Bie

. Automatic facial expression recognition based on a deep convolutional-neural-network structure. in: 2017 IEEE 15th International Conference on Software Engineering Research, Management and Applications (SERA). 2017; 123-128.

24.

Lyons

Akamatsu

Kamachi

Gyoba

. Coding facial expressions with gabor wavelets. in: Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition. 1998; 200-205.

25.

Lucey

Cohn

Kanade

Saragih

Ambadar

Matthews

. The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression. in: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-workshops. 2010; 94-101.

26.

Goodfellow

Erhan

Carrier

Courville

Mirza

Hamner

Cukierski

Tang

Thaler

Lee

Zhou

. Challenges in representation learning: A report on three machine learning contests. in: International Conference on Neural Information Processing. 2013; 117-124. Springer, Berlin, Heidelberg.

27.

Zhao

Huang

Taini

, Pietikä Inen

. Facial expression recognition from near-infrared videos. Image and Vision Computing. 2011; 29(9): 607-619.

28.

Vyas

Prajapati

Dabhi

. Survey on face expression recognition using CNN. in: 2019 5th International Conference on Advanced Computing & Communication Systems (ICACCS). 2019; 102-106.

29.

Shaver

Schwartz

Kirson

O’connor

. Emotion knowledge: Further exploration of a prototype approach. Journal of Personality and Social Psychology. 1987; 52(6): 1061.

30.

Deng

. Deep facial expression recognition: A survey. IEEE Transactions on Affective Computing. 2020.

31.

Liu

Zhang

Pan

. Facial expression recognition with CNN ensemble. in: 2016 International Conference on Cyberworlds (CW). 2016; 163-166.

32.

Zhao

Liang

Liu

Han

Vasconcelos

Yan

. Peak-piloted deep network for facial expression recognition. in: European Conference on Computer Vision. 2016; 425-442. Springer, Cham.

33.

Kingma

. Adam: A method for stochastic optimization. in: 3rd International Conference for Learning Representations. 2015.

34.

Ruder

. An overview of gradient descent optimization algorithms. arXiv preprint arXiv: 1609.04747. 2016.

35.

Kiros

Cho

Courville

Salakhudinov

Zemel

Bengio

. Show, attend and tell: Neural image caption generation with visual attention. in: International Conference on Machine Learning. 2015; 2048-2057. PMLR.

36.

Radford

Metz

Chintala

. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv: 1511.06434. 2015.

37.

Kipf

Welling

. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv: 1609.02907. 2016.

38.

Zhang

. Image based static facial expression recognition with multiple deep network learning. in: Proceedings of the 2015 ACM on International Conference on Multimodal Interaction. 2015; 435-442.

Concise CNN model for face expression recognition

Abstract

Keywords

1. Introduction

Table 1 Widely taken datasets for FER

Table 2 Classification accuracy obtained on test set for different architectures of CNN

References

Table 1
Widely taken datasets for FER

Table 2
Classification accuracy obtained on test set for different architectures of CNN