Real-time facial expression recognition using smoothed deep neural network ensemble

Abstract

Facial emotion recognition (FER) has been extensively researched over the past two decades due to its direct impact in the computer vision and affective robotics fields. However, the available datasets to train these models include often miss-labelled data due to the labellers bias that drives the model to learn incorrect features. In this paper, a facial emotion recognition system is proposed, addressing automatic face detection and facial expression recognition separately, the latter is performed by a set of only four deep convolutional neural network respect to an ensembling approach, while a label smoothing technique is applied to deal with the miss-labelled training data. The proposed system takes only 13.48 ms using a dedicated graphics processing unit (GPU) and 141.97 ms using a CPU to recognize facial emotions and reaches the current state-of-the-art performances regarding the challenging databases, FER2013, SFEW 2.0, and ExpW, giving recognition accuracies of 72.72%, 51.97%, and 71.82% respectively.

Keywords

Computer vision emotion recognition facial expression human-machine interaction label smoothing

1. Introduction

The evolution of robotics and artificial intelligence has taken a qualitative leap in domains such as dynamic stability [1], autonomous navigation [2], or speech recognition [3] among others, in the last decade. This has allowed social robots, which must be designed based on human-oriented perception, to behave and adapt to complex scenarios in more advanced manners [4], involving emotional feedback during human-robot interaction (HRI) [5]. Emotion estimation is a key for such applications and by improving its performance accuracy, enhances the living conditions of several social groups such as elders with mental diseases [6], autistic [7], and hospitalized children and their families [8], even open up to other categories.

Emotion estimation is a complex task [9], and despite the variability of emotional responses and their dependence on culture [10], there is a need to narrow down the reference system for experimental research by creating an emotional model. Based on psychological studies, emotions are generally divided into six prototypical categories [11]: anger, disgust, fear, happiness, sadness, and surprise; which the neutral emotion is defined as a baseline. Emotion recognition systems are developed based on several modalities such as: facial expressions [12, 13, 14], body gesture [15], speech [16], and physiological signals [17, 18].

Facial emotion recognition (FER) systems can be designed using two different strategies: Image-based and video-based. Although, video-based systems carry a high amount of emotional information due to their dynamic properties, they require a more complex analysis which is still complicated to implement in real-time systems. For this reason, this methodology was tested first using a simplified approach on static images; however, it is worth mentioning that FER in these conditions, stills a challenging task due to multiple environment variations such as brightness, occlusions, or body posture, which can affect the information available in the processed image.

Globally, FER image-based systems consist of three major stages: face detection and pre-processing, feature extraction, and classification. The first stage aims to detect the region of interest (face) in goal of removing unnecessary information to recognize the facial expression such as the background; it becomes challenging when acquisition conditions are poor mainly the lighting. The feature extraction leads to represent the pre-processed image with a characteristic vector including the relevant features. In FER, feature extraction techniques could be divided into two main categories: appearance-based and geometric-based methods. The first methods aim to detect textural information from face images using well-known feature extractors such as Gabor filters [19] and local binary patterns (LBP) [20]. Strategies based on geometric properties use overall the detected facial landmarks to encode the geometric information like angles and distances. The extracted features are then used for classification to recognize facial emotions.

In the last few years, the use of deep learning in several applications [21, 22] has become possible, since powerful graphics processing units (GPUs) are now available. Compared to traditional methods based on the aforementioned feature extraction techniques, convolutional neural networks (CNN) based techniques have achieved the FER state of the art recognition rates [14]. However, due to the training process’s stochastic properties, different sets of weights are found on each trained neural network; therefore, an intuition could lead to train several models to combine their predictions as an ensemble, and subsequently, may lead to a reduction of the generalization error in most cases since each single CNN extracts discriminative features for a specific task, as shown in our published paper [23].

However, it is worth mentioning that training FER models face two major challenges. First, environment variations such as brightness, occlusions, or body posture, can affect the information available to recognize a facial expression, and secondly, most of databases designed for training are prelabeled by humans, which contain inherent bias due to emotion perception complexity, as shown in Fig. 1. Thus, train a model with a wrong target class may lead to learn incorrect features and subsequently decrease the accuracy performance and generalization capabilities.

Figure 1.

Miss-labelled data samples from the FER 2013 training set, neutral emotions were labeled as happy, afraid, angry and sad respectively.

This paper extends the CNN ensembling method for facial emotion recognition proposed in our previous work [23], by including the label smoothing technique to deal with the miss-labelled data. Additional computation is provided through three different hardware setup, with two possible face detection systems depending on the image acquisition complexity respect to environment variation. The methodology was tested on three challenging databases, collected in real-world conditions and in unconstrained manner, FER 2013 [24], SFEW 2.0 [25], and ExpW [26].

The rest of the paper is organized as follows: Section 2 presents the related works, the proposed method’s stages are described in details in Section 3, followed by the experimental results as well as discussions in the fourth section. Conclusions and perspectives are drawn in Section 5.

2. Related works

In the facial emotion recognition field, several research works have been carried out to improve the emotion estimation accuracy. The use of deep learning has outperformed the conventional approaches that use the traditional feature extraction techniques; their training methodology can generally be divided into two main categories: single network learning and ensemble network learning.

The single network learning category uses only one architecture for the feature extraction and the recognition task. For example, Mollahosseini et al. [27] designed a CNN architecture of two convolutional layers and four inception layers for FER. In [28], they used a deep neural network with relativity learning (DNNRL) of three convolutional layers followed by three inception layers and trained by the triple loss. A multi-task convolutional network for simultaneous facial landmarks detection and facial expression recognition, is proposed in [29]. Tang et al. [30] proposed the DLSVM by replacing the softmax layer with a linear support vector machine (SVM) for FER application. Lian et al. [31] studied the contribution of each face region in the facial expression recognition using the class activation mapping technique (CAM) and a visualization model based on the Densenet-BC architecture.

The ensemble network learning methods adopt the ‘divide and conquer’ approach; the input image is fed through a set of deep architectures and the final decision is obtained by the combination of each network score output. Overall, the ensemble network learning methods achieve a better performance compared to the single learning category, since each network extracts different relevant features and may lead to minimizing the generalization error; however, this methodology requires additional computing time, given that for the same task, multiple networks are used compared to the first approach.

Levi et al. [32] proposed a novel method based on a deep ensemble network. First, by applying different radius parameters, LBP codes are extracted and mapped to a 3D space using multidimensional scaling (MDS); afterwards, the original RGB images and the mapped codes are used to train a set of 20 CNN models, and the predicted emotion is then obtained by the weighted average of each model’s output. Kim et al. [33] proposed to fuse information about non-aligned and aligned faces for FER. They introduced an alignment mapping network (AMN) to estimates aligned states of non-alignable faces, then with a set of 9 deep convolutional neural networks (DCNs), the average or the majority voting rules are used to compute the emotion prediction. Similarly, Pramerdorfer et al. [34] trained an ensemble of 8 CNNs achieving the best state-of-the-art performance on the FER 2013 database [24].

Most of the previous studies do not take into account the computation time of the FER system and its integration capability in real devices; furthermore, the miss-labelled training data that drives the model to learn incorrect features are not studied. Similar to [32, 33, 34], we proposed in [23] a deep network ensemble including only three CNN, with a comparable accuracy performance.

In this paper, we propose two possible implementations, with/without dedicated GPU by providing additional computations to evaluate our proposed model under the real-time constraints and its integration capability for global use, according to three hardware configurations. In regards to the miss-labelled data, we propose to use the label smoothing technique [35] by reducing the confidence given to image labels during the training process.

Figure 2.

Schematic overview of the proposed FER system, Stage 1 includes the face detection procedure and Stage 2 performs the facial emotion recognition.

3. Material and methods

The proposed framework is based on the following stages: Face detection and pre-processing, feature extraction and classification using deep neural networks, and label smoothing optimization followed by ensemble learning approach. Figure 2 shows a schematic overview of our proposed FER system.

3.1 Face detection and pre-processing

The face detection stage aims to reduce the amount of information given to the CNN networks designed to recognize facial emotions. Face detection is a highly investigated area, while the most known algorithm is the Viola-Jones method [36] based on the Haar feature-based cascade classifiers.

Respect to the deep learning era, several object detection techniques have been proposed in the literature, which could be categorized into two main approaches: region proposal based network and one-shot based network. The region proposal based methodology including RCNN [37] and its improved versions, Fast RCNN [38] and Faster RCNN [39], consists of two stages, first, the possible regions containing objects in the input image are selected, and then fed into a CNN for feature extraction and classification to identify the object and the coordinates of its bounding box. The one-shot based methodology, including SSD [40] and YOLO [41], needs only a single stage for detection and classification, contrary to the first approach.

Overall, the one-shot based methods are faster compared to the region proposal based methods which are computationally expensive, since they use two stages for object detection [42]. You Only Look Once (YOLO v2) [43] is fast and accurate, since it achieved a performance of 76.8 mAP outperforming the Faster-RCNN and SSD on the standard database VOC 2007 at 67 FPS, which it is suitable for real-time applications. It divides the input image into a $S\times S$ grid, which each cell aims to predict $B$ bounding boxes with their confidence scores according to $C$ classes.

We have chosen to deploy for our FER system, two face detection methodologies, the Viola-Jones technique as non-deep learning-based method and extend the YOLO v2 model, as a deep learning-based method, to deal with the face detection by training the network on the fully annotated face database, WIDER DB [44]. This choice is based on the final user’s hardware configuration since deep learning requires a GPU to deal with a high amount of computations.

Detected faces are cropped, converted from RGB to grayscale space, and resized to a resolution of 48 $\times$ 48 px. Finally, normalization is done into [0, 1] range.

3.2 Deep feature extraction and classification

Deep convolutional neural networks have achieved the best state-of-the-art performances in most computer vision and AI applications, thanks to their learning capability of high abstract representations, in comparison to classical feature extraction methods. However, they require large amounts of data to achieve a proper generalization, and a substantial computing capacity due to their significant trainable parameters.

Overall, the first convolution layers extract low-level features such as shapes and edges, while the deeper layers, learn more complex features. Therefore, our starting point consists of defining architecture as a baseline, intended to have as less number of parameters as possible to deal with the computation issue, then from this, we derive a set of deeper architectures to explore the complexity of facial emotions.

Table 1
Description of our baseline structure (Model A), $w=$ width, $h=$ height, $f=$ filters, $s=$ stride, $d=$ depth

Layer	Kernel, stride	Input size	Output size
	$w\times h\times f,s$	$w\times h\times d$	$w\times h\times d$
Input	/	48 $\times$ 48 $\times$ 1	/
Conv A-1	3 $\times$ 3 $\times$ 96, 1	48 $\times$ 48 $\times$ 1	48 $\times$ 48 $\times$ 96
Conv A-2	3 $\times$ 3 $\times$ 96, 1	48 $\times$ 48 $\times$ 96	48 $\times$ 48 $\times$ 96
Conv A-3	3 $\times$ 3 $\times$ 96, 1	48 $\times$ 48 $\times$ 96	48 $\times$ 48 $\times$ 96
Batch Norm 1	/	48 $\times$ 48 $\times$ 96	48 $\times$ 48 $\times$ 96
MaxPool 1	3 $\times$ 3, 2	48 $\times$ 48 $\times$ 96	23 $\times$ 23 $\times$ 96
Droupout 1	/	23 $\times$ 23 $\times$ 96	23 $\times$ 23 $\times$ 96
Conv A-4	3 $\times$ 3 $\times$ 192, 1	23 $\times$ 23 $\times$ 96	23 $\times$ 23 $\times$ 192
Conv A-5	3 $\times$ 3 $\times$ 192, 1	23 $\times$ 23 $\times$ 192	23 $\times$ 23 $\times$ 192
Conv A-6	3 $\times$ 3 $\times$ 192, 1	23 $\times$ 23 $\times$ 192	23 $\times$ 23 $\times$ 192
Batch Norm 2	/	23 $\times$ 23 $\times$ 192	23 $\times$ 23 $\times$ 192
MaxPool 2	3 $\times$ 3, 2	23 $\times$ 23 $\times$ 192	11 $\times$ 11 $\times$ 192
Droupout 2	/	11 $\times$ 11 $\times$ 192	11 $\times$ 11 $\times$ 192
Conv A-7	3 $\times$ 3 $\times$ 192, 1	11 $\times$ 11 $\times$ 192	11 $\times$ 11 $\times$ 192
Conv A-8	1 $\times$ 1 $\times$ 192, 1	11 $\times$ 11 $\times$ 192	11 $\times$ 11 $\times$ 192
Conv A-9	1 $\times$ 1 $\times$ 7, 1	11 $\times$ 11 $\times$ 192	11 $\times$ 11 $\times$ 7
Batch Norm 3	/	11 $\times$ 11 $\times$ 7	11 $\times$ 11 $\times$ 7
GAP 1	/	11 $\times$ 11 $\times$ 7	1 $\times$ 1 $\times$ 7
Softmax 1	/	1 $\times$ 1 $\times$ 7	1 $\times$ 1 $\times$ 7

Table 2

Description of the Model B last layers by extending the baseline model from the Conv A-7 layer, $w=$ width, $h=$ height, $f=$ filters, $s=$ stride, $d=$ depth

Layer	Kernel, stride	Input size	Output size
	$w\times h\times f,s$	$w\times h\times d$	$w\times h\times d$
Conv A-7	3 $\times$ 3 $\times$ 192, 1	11 $\times$ 11 $\times$ 192	11 $\times$ 11 $\times$ 192
Conv B-8	3 $\times$ 3 $\times$ 192, 1	11 $\times$ 11 $\times$ 192	9 $\times$ 9 $\times$ 192
MaxPool 3	3 $\times$ 3, 2	9 $\times$ 9 $\times$ 192	4 $\times$ 4 $\times$ 192
Droupout 3	/	4 $\times$ 4 $\times$ 192	4 $\times$ 4 $\times$ 192
Conv B-9	1 $\times$ 1 $\times$ 256, 1	4 $\times$ 4 $\times$ 192	4 $\times$ 4 $\times$ 256
Conv B-10	1 $\times$ 1 $\times$ 7, 1	4 $\times$ 4 $\times$ 256	4 $\times$ 4 $\times$ 7
Batch Norm 4	/	4 $\times$ 4 $\times$ 7	4 $\times$ 4 $\times$ 7
GAP 2	/	4 $\times$ 4 $\times$ 7	1 $\times$ 1 $\times$ 7
Softmax 2	/	1 $\times$ 1 $\times$ 7	1 $\times$ 1 $\times$ 7

Table 3

Description of the Model C last layers by extending the baseline model from the Conv A-7 layer, $w=$ width, $h=$ height, $f=$ filters, $s=$ stride, $d=$ depth

Layer	Kernel, stride	Input size	Output size
	$w\times h\times f,s$	$w\times h\times d$	$w\times h\times d$
Conv A-7	3 $\times$ 3 $\times$ 192, 1	11 $\times$ 11 $\times$ 192	11 $\times$ 11 $\times$ 192
Conv C-8	3 $\times$ 3 $\times$ 192, 1	11 $\times$ 11 $\times$ 192	9 $\times$ 9 $\times$ 192
MaxPool 4	3 $\times$ 3, 2	9 $\times$ 9 $\times$ 192	4 $\times$ 4 $\times$ 192
Droupout 4	/	4 $\times$ 4 $\times$ 192	4 $\times$ 4 $\times$ 192
Conv C-9	3 $\times$ 3 $\times$ 256, 1	4 $\times$ 4 $\times$ 192	4 $\times$ 4 $\times$ 256
Conv C-10	1 $\times$ 1 $\times$ 256, 1	4 $\times$ 4 $\times$ 256	4 $\times$ 4 $\times$ 256
Conv C-11	1 $\times$ 1 $\times$ 7, 1	4 $\times$ 4 $\times$ 256	4 $\times$ 4 $\times$ 7
Batch Norm 5	/	4 $\times$ 4 $\times$ 7	4 $\times$ 4 $\times$ 7
GAP 3	/	4 $\times$ 4 $\times$ 7	1 $\times$ 1 $\times$ 7
Softmax 3	/	1 $\times$ 1 $\times$ 7	1 $\times$ 1 $\times$ 7

3.2.1 Baseline model architecture

First, a series of well known deep learning architectures were tested and finally Conv-Pool-CNN-C was taken from [45], called hereafter model A and defined as our baseline. This choice was based on the tiny network architecture, composed of a set of 9 convolutional layers with a 3 $\times$ 3 kernel size and maxpooling layers. The last fully connected layer was replaced with a 1 $\times$ 1 convolutional layer followed by a global average pool (GAP) layer which reduces the number of parameters, scaling down spacial dimensionality and acts as a regularization layer preventing overfitting [46], giving a total of only 1.37 million parameters. Besides the GAP layer, we added several regularization techniques to further prevent from overfitting; dropout and batch normalization layers. Regards to the input and output layers, the first was adapted according to (48 $\times$ 48 px) after the pre-processing stage, where the softmax activation function is used for the last layer to obtain facial expression probabilities. Model A structure is provided in details in Table 1.

3.2.2 Derived model architectures

By adding maxpooling, dropout, and a set of convolutional layers, after the baseline’s Conv A-7 layer (see Table 1), three architectures have been derived, hereafter called: Model B, Model C, and Model D, respectively.

In Model B, we modified the Conv A-8 layer’s kernel size to 3 $\times$ 3 and added only one convolutional layer of 256 filters with 1 $\times$ 1 kernel, giving an architecture of 1.71 million parameters. Afterwards, in the model C, we extended the latter by 256 filters with kernel 3 $\times$ 3, increasing the network to 2.17 million learneable parameters. In the last model D, the 256 filters with kernel size of 1 $\times$ 1 have been replaced with a kernel size of 3 $\times$ 3, and then a dropoout layer and 512 filter with kernel size of 1 $\times$ 1 have been added, giving an architecture of 2.83 million parameters. The extension of each model from the baseline A is described in Tables 2–4.

Table 4
Description of the Model D last layers by extending the baseline model from the Conv A-7 layer, $w=$ width, $h=$ height, $f=$ filters, $s=$ stride, $d=$ depth

Layer	Kernel, stride	Input size	Output size
	$w\times h\times f,s$	$w\times h\times d$	$w\times h\times d$
Conv A-7	3 $\times$ 3 $\times$ 192, 1	11 $\times$ 11 $\times$ 192	11 $\times$ 11 $\times$ 192
Conv D-8	3 $\times$ 3 $\times$ 192, 1	11 $\times$ 11 $\times$ 192	11 $\times$ 11 $\times$ 192
MaxPool 5	3 $\times$ 3, 2	23 $\times$ 23 $\times$ 192	11 $\times$ 11 $\times$ 192
Droupout 5	/	11 $\times$ 11 $\times$ 192	11 $\times$ 11 $\times$ 192
Conv D-9	3 $\times$ 3 $\times$ 256, 1	11 $\times$ 11 $\times$ 192	11 $\times$ 11 $\times$ 192
Conv D-10	3 $\times$ 3 $\times$ 256, 1	11 $\times$ 11 $\times$ 192	11 $\times$ 11 $\times$ 7
Droupout 6	/	11 $\times$ 11 $\times$ 192	11 $\times$ 11 $\times$ 192
Conv D-11	1 $\times$ 1 $\times$ 512, 1	11 $\times$ 11 $\times$ 192	11 $\times$ 11 $\times$ 7
Conv D-12	1 $\times$ 1 $\times$ 7, 1	11 $\times$ 11 $\times$ 192	11 $\times$ 11 $\times$ 7
Batch Norm 5	/	11 $\times$ 11 $\times$ 7	11 $\times$ 11 $\times$ 7
GAP 4	/	11 $\times$ 11 $\times$ 7	1 $\times$ 1 $\times$ 7
Softmax 4	/	1 $\times$ 1 $\times$ 7	1 $\times$ 1 $\times$ 7

3.3 Label smoothing optimization

One of the main problems when facing the facial emotion recognition (FER) is that it contains miss-labelled images (see Fig. 1); thus, the model is prone to learn incorrect features from the source data. With this in mind, a strategy can be carried out based on label smoothing [35].

In the learning process, each label is assumed to be correct and categorical. The model is expected to learn from the input data and predicts the expected label. The latter is closely related to the loss function and driving gradients, which lead to reduced expected errors.

An intuitive example is given based on the cross-entropy loss function for classification according to $K$ classes: $L=-\sum_{y=1}^{K}p(y|x)\log(q(y|x))$ , being $L$ the computed loss, $p(y|x)$ the true label probability distribution, and $q(y|x)$ the probability of the image to belong to the label $y$ for each case $x$ . During training, the loss function drives the computed gradients determining a global minima. For those cases where the true label is accurately predicted, the error is minimal. On the contrary, for mislabelled data the error computed becomes maximal, producing dramatic consequences over the training process.

Labelling data with high confidence of $p(y|x)=1$ for training, pushes the model to over-fit data giving a high variance; consequently, a bad generalization on newly injected data to the final model. This problem can be avoided by reducing the confidence factor through the introduction of noise to the ground truth label distribution $p(y|x)$ , giving $p^{\prime}(y|x)=(1-\epsilon)p(y|x)+\frac{\epsilon}{K}$ , While the cross-entropy loss function becomes:

$\displaystyle L^{\prime}=-\!\sum_{y=1}^{K}\left[(1-\epsilon)p(y|x)+\frac{% \epsilon}{K}\right]\log(q(y|x))$ (1)

The fact under the assumption of having databases with incorrectly labelled data, due to human-error produced bias since the labelling is carried out by multiple participants, which may have different criteria. The solution, therefore, involves the confidence given to image labels. This produces a decrease in the loss function for the case of using miss-labelled data, circumventing the model for driving the gradients far away from the objective function. For that, label smoothing has been applied with a smoothing factor $\epsilon=$ 0.2. For example, a happy emotion encoded as [0, 0, 0, 1, 0, 0, 0], it becomes [0.03, 0.03, 0.03, 0.82, 0.03, 0.03, 0.03] by applying a smoothing factor of $\epsilon=$ 0.2.

3.4 Ensemble learning and fine-tuning

A set of different factors can be involved in that strategy. First, the way the training data is used to feed the models: One strategy uses k-fold cross-validation, were $k$ different models are trained on $k$ different subsets of the training data. Another strategy is based on splitting the training data-set with replacements; therefore, having different subsets of data with duplicated examples. This is called bootstrap aggregation. Both approaches are related to the possibility of the models converging into different local minima, leading to weakly correlated errors, whose average performance may improve individual estimates. Secondly, model weight’s initialization; different hyper-parameters; or the structure properties (number of layers or nodes) as our proposed CNN architectures A, B, C and D; may lead to a set of models covering the nuances of the solution space, and therefore improving the performance. Finally, prediction combination strategies can be applied such as averaging or taking into account the maximum estimate, as described in Eqs (2) and (3) respectively, for the softmax function. The ensemble’s probability distribution $q(y_{i}|x_{i})_{\textit{ens}}$ , built from $M$ single models, is computed as:

$\displaystyle q(y_{i}|x_{i})_{\textit{ens}}=\frac{1}{M}\sum_{m=1}^{M}\frac{% \exp(z_{y_{i}})}{\sum_{j=1}^{K}\exp(z_{j})}$ (2)

or:

$\displaystyle q(y_{i}|x_{i})_{\textit{ens}}=\max_{M}\frac{\exp(z_{y_{i}})}{% \sum_{j=1}^{K}\exp(z_{j})}$ (3)

Where $z_{y_{i}}$ defined as logit for candidate of class $i$ regarding $K$ classes.

Table 5

Emotion class distribution regarding the facial expression databases, FER 2013, SFEW 2.0 and ExpW

Database	Subset	Angry	Disgust	Afraid	Happy	Sad	Surprised	Neutral	Total
FER 2013	Training	3995	436	4097	7215	4830	3171	4965	28709
	Validation	491	55	528	879	594	416	626	3589
	Test	467	56	496	895	653	415	607	3589
SFEW 2.0	Training	178	52	78	184	161	94	144	891
	Validation	77	23	46	72	73	56	84	431
ExpW	All images	3671	3995	1088	30537	10559	7060	34883	91793

4. Experiments

To verify the validity of our method, we have carried out several experiments according to three phases. First, we trained our single models (A, B, C and D) on the FER 2013 database without label smoothing technique; afterwards, all possible ensembles have been built to investigate the ensembling generalization capability against single models. In the second phase, we have trained the single models using label smoothing with $\epsilon=$ 0.2, and compare their performances with the first models trained without label smoothing, and similarly, for the ensembles. Finally, according to a transfer learning approach, the trained single model’s weights on FER 2013 with/without label smoothing have been fine-tuned on the SFEW 2.0 and ExpW databases separately.

To keep our recognition performances regarding FER 2013 and SFEW 2.0 databases, comparable with those reported in the literature, we have trained the four single models with/without label smoothing using the same training subset according to each original holdout dataset split. Similarly, for the ExpW database since any particular split is initially provided, the FER 2013 ratio has been considered, 80% for training, 10% for validation, and 10% for test.

Special attention has been paid to the FER processing time in order to verify the integration capability in real devices, facing the real-time constraints. Three different hardware configurations have been tested: Multi-core i7 CPU without dedicated GPU, multi-core i7 CPU with a GeForce GTX 1080 GPU, and the Jetson Nano with a Tegra X1 GPU for embedded systems. The time processing was computed on each hardware and for each frame, through the facial detection and facial expression recognition stages independently. For the face detection step, the Viola-Jones and YOLO-v2 methods were tested. The experimental protocol was performed using an attached webcam with a resolution of 640 $\times$ 480 px.

All architectures have been implemented using keras framework, trained using the categorical cross entropy loss and Adam optimizer, with a batch size of 128 for 100 epoch each. We used the learning rate reducer and the early stopper callbacks to get the best model weights according to the validation set loss.

4.1 Databases

To validate and keep the highest challenge to implement FER in real-environment scenarios, experiments have been carried out on three benchmark databases collected in the Wild, FER 2013 [24], SFEW 2.0 [25], and ExpW [26] datasets.

4.1.1 FER 2013 database

FER 2013 is a large scale database, introduced in the ICML 2013 challenges in representation learning. It consists of 3 subsets of 48 $\times$ 48 px face images collected in the Wild, 28709 images are dedicated to training, 3589 images for validation, and 3589 images for test. All images include the following labeling: 0 angry, 1 disgust, 2 afraid, 3 happy, 4 sad, 5 surprised, and 6 for neutral.

4.1.2 SFEW 2.0 database

The Static Facial Expression in the Wild database (SFEW) is collected by selecting static frames from movies available in the Acted Facial Expression in the Wild database (AFEW) [47], including unconstrained facial expressions under challenging scenarios such as lighting and occlusions, for a total of 95 subjects. It consists of three subsets: 891 frames for training, 431 frames for validation, and 372 frames for test, labeled according to the six prototypical expressions: angry, disgust, afraid, happy, sad, and surprised, plus the neutral expression. The test set labelling is not publicly available.

4.1.3 ExpW database

The expression in-the-Wild database (ExpW) consists of 91793 unconstrained face images, collected using the Google image search API and manually labelled following the seven basic expressions: angry, disgust, afraid, happy, sad, surprised, and neutral. The database is fully face annotated and a confidence score from 0 to 100 is provided for each bounding-box as well non-face images were removed. Since the creators did not provide a particular validation split, we split the ExpW database similarly to the FER 2013 ratio, 80% for training, 10% for validation, and 10% for test; giving 73434, 9180 and 9179 images for each subset respectively.

The data distribution of each database are given in details in Table 5.

Table 6
Obtained recognition performances for CNN single models with/without label smoothing on the FER 2013 validation and test sets

Model	Without label smoothing		With label smoothing
	Validation accuracy	Test accuracy	Validation accuracy	Test accuracy
A	66.40%	68.68%	67.15%	68.74%
B	65.76%	67.48%	66.95%	69.30%
C	65.51%	68.15%	68.60%	68.99%
D	67.26%	68.15%	67.85%	69.24%

Table 7

Obtained recognition performances using CNN ensemble models with/without label smoothing on the FER 2013 validation and test sets

Ensembles	Without label smoothing				With label smoothing
	Validation accuracy		Test Accuracy		Validation accuracy		Test accuracy
	Average	Maximum	Average	Maximum	Average	Maximum	Average	Maximum
A B	68.96%	68.65%	70.47%	70.38%	69.43%	69.07%	71.75%	71.11%
A C	68.93%	68.15%	70.91%	70.97%	70.27%	69.60%	71.19%	70.83%
A D	69.77%	69.32%	71.11%	70.55%	69.63%	69.52%	71.39%	70.94%
B C	68.43%	68.04%	70.58%	70.05%	69.57%	69.60%	71.52%	70.86%
B D	67.46%	67.32%	69.60%	69.43%	69.16%	68.85%	70.72%	70.63%
C D	68.68%	68.04%	70.08%	69.80%	69.60%	69.24%	70.47%	70.05%
A B C	69.30%	69.32%	72.47%	71.75%	70.88%	70.47%	72.72%	71.47%
A B D	69.41%	69.30%	71.08%	70.80%	69.74%	69.41%	71.94%	71.16%
B C D	68.88%	68.15%	71.50%	70.69%	69.91%	69.49%	71.47%	70.88%
C A D	69.69%	69.52%	71.66%	71.08%	70.19%	69.69%	71.77%	71.33%
A B C D	70.00%	69.43%	72.14%	71.55%	70.66%	69.94%	72.22%	71.19%

Table 8

Obtained recognition performances for CNN single models with/ without label smoothing oregarding SFEW 2.0 validation set

Model	Without label smoothing	With label smoothing
A	46.87%	47.80%
B	47.33%	47.80%
C	48.49%	49.19%
D	46.17%	47.10%

4.2 Results and discussions

The obtained results can be analyzed according to three aspects, the ensemble performance, label smoothing optimization, and computation time. Tables 6, 8 and 10 show the single model performances with/without label smoothing technique regarding FER 2013, SFEW 2.0, and ExpW databases respectively. These models have been used to build the ensembles, which their performances are provided in Tables 7, 9 and 11. The best recognition accuracy for each benchmark database is then taken and compared with the reported performances in the literature and highlighted in Table 13.

Table 9
Obtained recognition performances using CNN ensemble models with/without label smoothing on the SFEW 2.0 validation set

Ensembles	Without label smoothing		With label smoothing
	Average	Maximum	Average	Maximum
A B	45.94%	46.40%	48.26%	48.03%
A C	48.49%	48.96%	50.12%	48.96%
A D	48.72%	48.26%	50.35%	49.65%
B C	48.03%	47.80%	47.33%	46.40%
B D	49.19%	48.72%	48.72%	48.49%
C D	48.96%	48.49%	48.03%	48.03%
A B C	48.03%	47.10%	49.42%	47.56%
A B D	50.35%	48.96%	50.81%	48.96%
B C D	49.65%	49.65%	48.72%	48.96%
C A D	51.04%	48.96%	51.97%	49.19%
A B C D	49.88%	48.96%	51.04%	49.19%

Table 10

Obtained recognition performances for CNN single models with/without label smoothing on the ExpW validation and test sets

Model	Without label smoothing		With label smoothing
	Validation accuracy	Test accuracy	Validation accuracy	Test accuracy
A	69.52%	68.48%	70.27%	69.81%
B	70.65%	69.58%	70.65%	69.63%
C	70.58%	69.47%	70.95%	70.18%
D	70.84%	69.28%	70.47%	69.88%

Table 11

Obtained recognition performances using CNN ensemble models with/without label smoothing on the ExpW validation and test sets

Ensembles	Without label smoothing				With label smoothing
	Validation accuracy		Test accuracy		Validation accuracy		Test accuracy
	Average	Maximum	Average	Maximum	Average	Maximum	Average	Maximum
A B	70.87%	70.95%	69.67%	69.70%	71.31%	71.02%	70.81%	70.67%
A C	71.42%	71.26%	70.35%	70.08%	71.84%	71.60%	71.30%	71.15%
A D	71.37%	71.51%	69.72%	69.59%	71.59%	71.37%	71.04%	70.78%
B C	71.71%	71.46%	70.38%	70.12%	71.63%	71.56%	71.08%	70.76%
B D	71.67%	71.60%	70.42%	70.24%	71.92%	71.56%	70.91%	70.34%
C D	71.75%	71.84%	70.39%	70.35%	71.93%	71.59%	70.96%	70.88%
A B C	71.78%	71.51%	70.51%	70.25%	72.21%	71.98%	71.49%	71.25%
A B D	72.03%	71.69%	10.26%	70.12%	72.05%	71.53%	71.27%	71.04%
B C D	71.99%	71.82%	70.69%	70.54%	72.47%	72.02%	71.35%	71.06%
C A D	71.95%	71.85%	70.35%	70.42%	72.10%	71.82%	71.60%	71.18%
A B C D	72.06%	71.90%	70.68%	70.50%	72.54%	71.95%	71.82%	71.35%

Table 12

Comparative study of the proposed FER computation time, regarding three hardware configurations using the ensemble ABC

Face detection method	Hardware	Pre-processing time	Ensemble recognition time	Total
Viola Jones	i7CPU	40.84 ms $\pm$ 4.55	101.13 ms $\pm$ 6.63	141.97 ms $\pm$ 3.34
YOLO v2	i7CPU	201.81 ms $\pm$ 7.73	105.9 ms $\pm$ 5.73	307.71 ms $\pm$ 3.67
Viola Jones	GeForce GTX 1080	12.45 ms $\pm$ 0.5	4.78 ms $\pm$ 0.22	17.23 ms $\pm$ 0.84
YOLO v2	GeForce GTX 1080	8.59 ms $\pm$ 1.1	4.89 ms $\pm$ 1.12	13.48 ms $\pm$ 1.48
Viola Jones	Tegra X1	68.05 ms $\pm$ 9.60	69.17 ms $\pm$ 1.74	137.22 ms $\pm$ 3.37
YOLO v2	Tegra X1	134.39 ms $\pm$ 2.01	68.65 ms $\pm$ 1.39	203.04 ms $\pm$ 1.84

Table 13

Comparative study of reported performances in the literature regarding FER 2013, SFEW 2.0 and ExpW databases

Database	Methods	Reported rerformance
FER 2013 (test set)	Mollahosseini et al. (2016) [27]	66.40%
	Devries et al. (2014) [29]	67.21%
	Tang (2016) [30]	71.20%
	Guo et al. (2016) [28]	71.33%
	Benamara et al. (Ours – 2019) [23]	72.47%
	Proposed Method	72.72%
	Kim et al. (2016) [33]	73.73%
	Pramerdorfer et al. (2016) [34]	75.20%
SFEW 2.0 (validation set)	Ng et al. (2015) [48]	48.50%
	Meng et al. (2017) [49]	50.98%
	Li et al. (2017) [50]	51.05%
	Levi et al. (2015) [32]	51.75%
	Proposed Method	51.97%
	Cai et al. (2018) [51]	52.52%
	Ding et al. (2016) [52]	55.15%
ExpW	Proposed Method	71.82%
	Zheng et al. (2020) [31]	71.90%

Figure 3.

Obtained confusion matrices for the proposed single CNN models and for the two best ensembles regarding FER 2013 test set accuracy (Up: Average fusion strategy, Down: Maximum fusion strategy).

4.2.1 Ensemble performance analysis

Respect to the obtained confusion matrices for every single model illustrated in Fig. 3, trained on the FER 2013 database, we can figure out that single models outperform the generalization for particular emotions as is the case in Model A for ‘sad’ and ‘surprised’; Model B for ‘disgust’, ‘happy’ and ‘eutral’; Model C for ‘fraid’; and Model D for ‘angry’ and ‘neutral’. In comparison with ensemble models, ABC and ABCD ensembles take advantage of the generalization capabilities for each single model in contrast to using them independently, achieving our best result on FER 2013 with 72.47% using the ensemble ABC with the average score fusion strategy. It is validated on the SFEW 2.0 with the best accuracy of 51.04% using the ensemble CAD, and on the ExpW database with 70.69% using the ensemble BCD. The results thus obtained show that the ensembles outperform the single models in term of recognition accuracy.

It should be mentioned that the average score fusion strategy gives the best performance compared to the maximum score fusion strategy in most cases. This shows that the contribution of all single models to estimate the correct emotion outperforms the high confidence of only one model, since facial emotion recognition stills a challenging task even for humans due to similarities between emotional states. Furthermore, we can figure out that ensemble performance does not depend on the number of models that constitute it since the best accuracies on FER 2013, SFEW 2.0, and ExpW databases, have been obtained using only 3 CNNs instead of 4 CNNs.

4.2.2 Label smoothing analysis

Label smoothing technique improves the recognition accuracy compared to models trained without label smoothing, regarding our four single models and all experiments carried out on three databases, with the highest improvement of 0.84% for model C on FER 2013, 0.93% for model A and D on SFEW 2.0, and 1.33% for model A on ExpW, as shown in Tables 6, 8, and 10 respectively. This has contributed as well to improve the ensemble’s performances, built using the single models trained with label smoothing, achieving our best results with 72.72% using the ensemble ABC regarding FER 2013, 51.97% using the ensemble CAD regarding SFEW 2.0, and 71.82% using the ensemble ABCD regarding the ExpW database.

For a better check, we employed first a dimensionality reduction technique, principal component analysis (PCA), on the extracted features from the inner layers of models B and C, trained with/without label smoothing on the FER 2013 validation set, and then projected using t-SNE [53] to 2D space as shown in Fig. 4. We can figure out that several samples have been miss-labelled particularly happy emotion as neutral and afraid as a surprise. Thus, label smoothing technique contributes to clustering data distribution, which it will be easier for a multi-class classifier to separate the emotion classes, and this explains the improved performance compared to models trained without label smoothing

Figure 4.

t-SNE visualization of extracted features from FER 2013 validation set using a trained network with and without label smoothing: (a) using trained Model B without label smoothing, (b) using trained Model C without label smoothing, (c) using trained Model B with label smoothing, (d) using trained Model C with label smoothing (miss-labelled samples are marked with red squares).

Figure 5.

Face detection carried out with YOLO v2 model and Viola-Jones method on ExpW database samples: (Red) Detected bounding boxes using YOLO v2, (Blue) Detected bounding boxes using Viola-Jones method, (Green) Ground truth bounding boxes provided by ExpW database creators [26].

4.2.3 Time computation analysis

The system is capable of recognizing facial expressions in real-time constraints, as shown in Table 12. It takes only an average time of 13.48 ms when running on a hardware configuration of Intel i7 CPU and Nvidia GTX 1080 GPU, using YOLO v2 model for the face detection stage and an ensemble of 3 CNN (models A, B, and C) for the facial emotion recognition. This is a competitive solution that can be employed easily on videos with 60 FPS. In contrast, for hardware configuration with the only CPU, it takes an average time of 307.71 ms when YOLO v2 model is used for face detection and 141.97 ms when it is replaced by the Viola-Jones method since it uses low computations compared to the deep learning-based model, which it is fast enough for a video of 6 FPS. It depends on the environment scenario where facial emotion recognition is employed since YOLO v2 model is robust to detect unconstrained faces compared to the Viola-Jones. It should be mentioned that performance could be improved by reducing the video resolution from 640 $\times$ 480 px to 320 $\times$ 240 px.

On the other hand, artificial intelligence embedded in robotics is evolving to the use of GPU based hardware architectures [54, 55, 56] to take advantage of deep learning models, as for the Jetson Nano with Tegra X1 GPU, it takes 203.04 ms when using YOLO v2 model and 137.22 ms when using the Viola-Jones method for face detection. It can easily adapt to any robot as an embedded sub-system for task-oriented designs.

4.2.4 Discussion

The developed FER system proposes a robust solution for emotional state estimation based on facial expressions and for several end-user hardware configurations. Regards to applications that require an unconstrained face detection, it is recommended to use the YOLO v2 model than the Viola-Jones method, since it is robust in real-environment as illustrated in Fig. 5, where the Viola-Jones method often fails to detect occlusions and oriented faces, but in the price of a high amount of computations. Our system faces the real-time constraints when a powerful GPU is used such as the GTX 1080, since it runs smoothly and takes only 13.48 ms for face detection and emotion recognition. In contrast, To adapt to any robot, the Jetson Nano with Tegra X1 GPU could be an interesting solution as an embedded sub-system since our FER system will work easily on videos with 5 FPS.

The obtained recognition results are in good agreement with other studies [32, 33, 34] which have shown that ensembling learning often gives higher performance compared to single model’s performances. Furthermore, based on their reported results, our system reaches the current standards in facial emotion recognition with comparable performance regarding three challenging databases, FER 2013, SFEW 2.0, and ExpW, collected in the wild, without using any data augmentation or face alignment technique. It uses only 3 CNNs for a total of 5.25 million parameters regarding our best ensemble ABC on the FER 2013 database, against [32], which have used a set of 20 deep neural networks, or [33] with 9 DCNs, or [34] with 8 CNNs. Regarding the ExpW database, we achieved a slightly lower performance of 0.08% than [31]; nevertheless, in their experiments, they selected all images with a high box confidence score of 60 to 100 from this database, contrary to our work, which all images have been used even with the lower box confidence scores. Our system is quite straightforward and can be adapted easily for several computer vision and robotics applications.

In this paper, we extended our previous work [23] by adding the label smoothing technique to our models, reducing the confidence given to the training data labels, based on the assumption of having several miss-labelled data that can drive the model to learn wrong features. The label smoothing has demonstrated its efficiency on clustering data distributions as shown from the t-SNE visualization in Fig. 4, which miss-labelled data with a high amount of emotion similarity could join the correct cluster and further improving the recognition performance with an efficient classification stage. It should be mentioned that a few images could have clear wrong labelling from the source, like those marked with red squares from the t-SNE visualization in Fig. 4, and the label smoothing will fail to deal with these samples.

5. Conclusions

In this paper, we have proposed a framework for facial emotion recognition. It adopts the neural network ensemble learning with only three to four CNN architectures, for a better model generalization. Dividing and conquering has demonstrated to be a good strategy since the first stage deals only with the face detection and pre-processing in different conditions, and allow to focus the second stage only on inferring the emotional state; therefore, simplifying the whole process.

The label smoothing technique has clearly shown that reducing the confidence given to training labels improves the FER accuracy, and when combined with the ensembling learning, reaches the current state-of-the art performances according to three challenging databases, FER 2013, SFEW 2.0, and ExpW.

The proposed FER system provides an advanced solution to expand several computer vision applications and could improve human-robot interactions (HRI), mainly affective robotics, by rendering the robot accurately aware of the users’ emotional state. This could be used for therapeutic purposes such as autism’s and Alzheimer patients, and as a companion for elder people.

For future works, we plan to combine the findings on static images with the temporal information provided by videos to study in-depth the variations between the facial expression states, and further the emotional state provided by the speech intensity. This could lead to improve emotion recognition instead of using the images only.

Footnotes

Acknowledgments

We want to acknowledge to Programa de Ayudas a Grupos de Excelencia de la Región de Murcia, from Fundación Séneca, Agencia de Ciencia y Tecnología de la Región de Murcia.

References

Nelson

Saunders

Playter

. The PETMAN and Atlas Robots at Boston Dynamics. Humanoid Robotics: A Reference. 2019; 169–186.

Cunningham

Galceran

Mehta

Ferrer

Eustice

Olson

. MPDM: Multi-policy Decision-Making from Autonomous Driving to Social Robot Navigation. In: Control Strategies for Advanced Driver Assistance Systems and Autonomous Driving Functions. Springer; 2019. pp. 201–223.

Chiu

Sainath

Prabhavalkar

Nguyen

Chen

, et al. State-of-the-art speech recognition with sequence-to-sequence models. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2018. pp. 4774–4778.

Martin-Rico

Gomez-Donoso

Escalona

Garcia-Rodriguez

Cazorla

. Semantic visual recognition in a cognitive architecture for social robots. Integrated Computer-Aided Engineering. 2020 May; 27(3): 301–316.

Almagro

Fresno

de la Paz

. Speech gestural interpretation by applying word representations in robotics. Integrated Computer-Aided Engineering. 2018 Dec; 26(1): 97–109.

Šabanović

Bennett

Chang

Huber

. PARO robot affects diverse interaction modalities in group sensory therapy for older adults with dementia. In: 2013 IEEE 13th International Conference on Rehabilitation Robotics (ICORR). IEEE; 2013. pp. 1–6.

Wood

Zaraki

Robins

Dautenhahn

. Developing Kaspar: A Humanoid Robot for Children with Autism. International Journal of Social Robotics. 2019 Jul; Available from: https://doi.org/10.1007/s12369-019-00563-6.

Jeong

Zisook

Plummer

Breazeal

Weinstock

Logan

, et al. A Social Robot to Mitigate Stress, Anxiety, and Pain in Hospital Pediatric Care. In: Proceedings of the Tenth Annual ACM/IEEE International Conference on Human-Robot Interaction Extended Abstracts – HRI’15 Extended Abstracts. Portland, Oregon, USA: ACM Press; 2015. pp. 103–104.

Tang

Deng

Goerner

Zhang

. Object affordance based multimodal fusion for natural Human-Robot interaction. Cognitive Systems Research. 2019; 54: 128–137.

10.

Jack

Garrod

Caldara

Schyns

. Facial expressions of emotion are not culturally universal. Proceedings of the National Academy of Sciences. 2012; 109(19): 7241–7244.

11.

Ekman

. Pictures of facial affect. Consulting Psychologists Press. 1976.

12.

Lin

Pan

. Integrating a mixed-feature model and multiclass support vector machine for facial expression recognition. Integrated Computer-Aided Engineering. 2009 Jan; 16(1): 61–74.

13.

Taleb

Mammar

Ouamri

. New face expression recognition using polar angular radial transform and principal component analysis. International Journal of Biometrics. 2018; 10(2): 176.

14.

Deng

. Deep Facial Expression Recognition: A Survey. IEEE Transactions on Affective Computing. 2020; 1–1.

15.

Noroozi

Corneanu

Kamińska

Sapiński

Escalera

Anbarjafari

. Survey on Emotional Body Gesture Recognition. arXiv:180107481 [cs]. 2018 Jan; ArXiv: 1801.07481.

16.

Akçay

Oğuz

. Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers. Speech Communication. 2020 Jan; 116: 56–76.

17.

Emotion Recognition From EEG Using Higher Order Crossings; 14.

18.

Zheng

Cui

Zong

. EEG emotion recognition based on graph regularized sparse linear regression. Neural Processing Letters. 2019 Apr; 49(2): 555–571.

19.

Xie

Shan

Chen

Meng

Gao

. Learned local gabor patterns for face representation and recognition. Signal Processing. 2009 Dec; 89(12): 2333–2344.

20.

Ojala

Pietikäinen

Harwood

. A comparative study of texture measures with classification based on featured distributions. Pattern Recognition. 1996 Jan; 29(1): 51–59.

21.

Wang

Bai

. Regional parallel structure based CNN for thermal infrared face identification. Integrated Computer-Aided Engineering. 2018 May; 25(3): 247–260.

22.

Vera-Olmos

Pardo

Melero

Malpica

. DeepEye: Deep convolutional network for pupil detection in real environments. Integrated Computer-Aided Engineering. 2018 Dec; 26(1): 85–95.

23.

Benamara

Val-Calvo

Álvarez Sánchez

Díaz-Morcillo

Ferrández Vicente

Fernández-Jover

, et al. Real-Time Emotional Recognition for Sociable Robotics Based on Deep Neural Networks Ensemble. In: Ferrández Vicente

Álvarez Sánchez

de la Paz López

Toledo Moreo

Adeli

, eds. Understanding the Brain Function and Emotions. vol. 11486. Cham: Springer International Publishing; 2019. pp. 171–180.

24.

Goodfellow

Erhan

Carrier

Courville

Mirza

Hamner

, et al. Challenges in Representation Learning: A report on three machine learning contests. arXiv:13070414 [cs, stat]. 2013 Jul.

25.

Dhall

Goecke

Lucey

Gedeon

. Static facial expression analysis in tough conditions: Data, evaluation protocol and benchmark. In: 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops). Barcelona, Spain: IEEE; 2011. pp. 2106–2112.

26.

Zhanpeng Zhang CCL Ping Luo Tang

. From Facial Expression Recognition to Interpersonal Relation Prediction. In: arXiv:1609.06426v2; 2016.

27.

Mollahosseini

Chan

Mahoor

. Going deeper in facial expression recognition using deep neural networks. In: 2016 IEEE Winter Conference on Applications of Computer Vision (WACV); 2016. pp. 1–10.

28.

Guo

Tao

Xiong

Tao

. Deep Neural Networks with Relativity Learning for facial expression recognition. In: 2016 IEEE International Conference on Multimedia Expo Workshops (ICMEW); 2016. pp. 1–6.

29.

Devries

Biswaranjan

Taylor

. Multi-task Learning of Facial Landmarks and Expression. In: 2014 Canadian Conference on Computer and Robot Vision; 2014. pp. 98–103.

30.

Tang

. Deep Learning using Linear Support Vector Machines. arXiv:13060239 [cs, stat]. 2013 Jun.

31.

Lian

Tao

Huang

Niu

. Expression analysis based on face regions in real-world conditions. International Journal of Automation and Computing. 2020 Feb; 17(1): 96–107.

32.

Levi

Hassner

. Emotion Recognition in the Wild via Convolutional Neural Networks and Mapped Binary Patterns. In: Proceedings of the 2015 ACM on International Conference on Multimodal Interaction – ICMI ’15. Seattle, Washington, USA: ACM Press; 2015. pp. 503–510.

33.

Kim

Dong

Roh

Kim

Lee

. Fusing Aligned and Non-aligned Face Information for Automatic Affect Recognition in the Wild: A Deep Learning Approach. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Las Vegas, NV, USA: IEEE; 2016. pp. 1499–1508.

34.

Pramerdorfer

Kampel

. Facial Expression Recognition using Convolutional Neural Networks: State of the Art. arXiv:161202903 [cs]. 2016 Dec.

35.

Szegedy

Vanhoucke

Ioffe

Shlens

Wojna

. Rethinking the Inception Architecture for Computer Vision. arXiv:151200567 [cs]. 2015 Dec; ArXiv: 1512.00567. Available from: http://arxiv.org/abs/1512.00567.

36.

Viola

Jones

. Rapid object detection using a boosted cascade of simple features. In: Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on. Vol. 1. IEEE; 2001. pp. 1–1.

37.

Girshick

Donahue

Darrell

Malik

. Rich feature hierarchies for accurate object detection and semantic segmentation. arXiv:13112524 [cs]. 2014 Oct.

38.

Girshick

. Fast R-CNN. arXiv:150408083 [cs]. 2015 Sep.

39.

Ren

Girshick

Sun

. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. arXiv:150601497 [cs]. 2016 Jan; ArXiv: 1506.01497.

40.

Liu

Anguelov

Erhan

Szegedy

Reed

, et al. SSD: Single Shot MultiBox Detector. arXiv:151202325 [cs]. 2016; 9905: 21–37.

41.

Redmon

Divvala

Girshick

Farhadi

. You Only Look Once: Unified, Real-Time Object Detection. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, USA: IEEE; 2016. pp. 779–788.

42.

Liu

Ouyang

Wang

Fieguth

Chen

Liu

, et al. Deep learning for generic object detection: A survey. International Journal of Computer Vision. 2020; 128(2): 261–318.

43.

Redmon

Farhadi

. YOLO9000: better, faster, stronger. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2017. pp. 7263–7271.

44.

Yang

Luo

Loy

Tang

. WIDER FACE: A Face Detection Benchmark. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, USA: IEEE; 2016. pp. 5525–5533.

45.

Springenberg

Dosovitskiy

Brox

Riedmiller

. Striving for Simplicity: The All Convolutional Net. arXiv:14126806 [cs]. 2014 Dec.

46.

Lin

Chen

Yan

. Network In Network. arXiv:13124400 [cs]. 2013 Dec.

47.

Dhall

Goecke

Lucey

Gedeon

. Collecting large, richly annotated facial-expression databases from movies. IEEE Multi Media. 2012 Jul; 19(3): 34–41.

48.

Nguyen

Vonikakis

Winkler

. Deep Learning for Emotion Recognition on Small Datasets using Transfer Learning. In: Proceedings of the 2015 ACM on International Conference on Multimodal Interaction – ICMI ’15. Seattle, Washington, USA: ACM Press; 2015. pp. 443–449.

49.

Meng

Liu

Cai

Han

Tong

. Identity-Aware Convolutional Neural Network for Facial Expression Recognition. In: 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017). Washington, DC, DC, USA: IEEE; 2017. pp. 558–565.

50.

Deng

. Reliable Crowdsourcing and Deep Locality-Preserving Learning for Expression Recognition in the Wild. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI: IEEE; 2017. pp. 2584–2593.

51.

Cai

Meng

Khan

OReilly

Tong

. Island Loss for Learning Discriminative Features in Facial Expression Recognition. In: 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018). Xi’an: IEEE; 2018. pp. 302–309.

52.

Ding

Zhou

Chellappa

. FaceNet2ExpNet: Regularizing a Deep Face Recognition Net for Expression Recognition. 2016 Sep.

53.

Maaten

Lvd

Hinton

. Visualizing data using t-SNE. Journal of Machine Learning Research. 2008; 9(Nov): 2579–2605.

54.

Manderson

Dudek

. GPU-Assisted Learning on an Autonomous Marine Robot for Vision-Based Navigation and Image Understanding. In: OCEANS 2018 MTS/IEEE Charleston. IEEE; 2018, pp. 1–6.

55.

Mittal

. A survey on optimized implementation of deep learning models on the NVIDIA Jetson platform. Journal of Systems Architecture. 2019.

56.

Buettner

Baumgartl

. A highly effective deep learning based escape route recognition module for autonomous robots in crisis and emergency situations. In: Proceedings of the 52nd Hawaii International Conference on System Sciences; 2019.

Real-time facial expression recognition using smoothed deep neural network ensemble

Abstract

Keywords

1. Introduction

3.1 Face detection and pre-processing

3.2 Deep feature extraction and classification

Table 1 Description of our baseline structure (Model A), w = width, h = height, f = filters, s = stride, d = depth

3.2.2 Derived model architectures

Table 4 Description of the Model D last layers by extending the baseline model from the Conv A-7 layer, w = width, h = height, f = filters, s = stride, d = depth

4.1 Databases

4.1.1 FER 2013 database

4.1.2 SFEW 2.0 database

4.1.3 ExpW database

Table 6 Obtained recognition performances for CNN single models with/without label smoothing on the FER 2013 validation and test sets

Table 9 Obtained recognition performances using CNN ensemble models with/without label smoothing on the SFEW 2.0 validation set

4.2.2 Label smoothing analysis

4.2.4 Discussion

5. Conclusions

Footnotes

Acknowledgments

References

Table 1
Description of our baseline structure (Model A), $w=$ width, $h=$ height, $f=$ filters, $s=$ stride, $d=$ depth

Table 4
Description of the Model D last layers by extending the baseline model from the Conv A-7 layer, $w=$ width, $h=$ height, $f=$ filters, $s=$ stride, $d=$ depth

Table 6
Obtained recognition performances for CNN single models with/without label smoothing on the FER 2013 validation and test sets

Table 9
Obtained recognition performances using CNN ensemble models with/without label smoothing on the SFEW 2.0 validation set