Facial expression recognition under constrained conditions using stacked generalized convolution neural network

Abstract

A cognitive-analysis of facial features can make facial expression recognition system more robust and efficient for Human-Machine Interaction (HMI) applications. Through this work, we propose a new methodology to improve accuracy of facial expression recognition system even with the constraints like partial hidden faces or occlusions for real time applications. As a first step, seven independent facial segments: Full-Face, half-face (left/right), upper half face, lower half face, eyes, mouth and nose are considered to recognize facial expression. Unlike the work reported in literature, where arbitrarily generated patch type occlusions on facial regions are used, in this work a detailed analysis of each facial feature is explored. Using the results thus obtained, these seven sub models are combined using a Stacked Generalized ensemble method with deep neural network as meta-learner to improve accuracy of facial expression recognition system even in occluded state. The accuracy of the proposed model improved up to 30% compared to individual model accuracies for cross-corpus seven model datasets. The proposed system uses CNN with RPA compliance and is also configured on Raspberry Pi, which can be used for HRI and Industry 4.0 applications which involve face occlusion and partially hidden face challenges.

Keywords

Facial features analysis facial expression recognition convolution neural networks occlusion human-machine interaction stacked generalization

1. Introduction

Facial expression recognition has a significant role in the growth of cognitive system for Human Machine Interaction (HMI). There is a surge in development of collaborative and social robots due to rapid progress in robotic technology, hardware efficiency and artificial intelligence. Progress in HMI has inspired a few implementations where robot is programmed to follow human emotions. In many applications, such as E-Learning feedback mechanism, entertainment industry, driver mindfulness ready framework, legitimate sciences, helpful guide, cerebrum science, situation analysis of social interaction, robot interaction therapy for autistic children, affective computing, depression Level Analysis etc. [1, 2, 3, 4, 5, 6, 7, 8, 9], the task efficiency gets enhanced with facial expression recognition.

Since Darwin’s work in 1872, facial expression analysis has been an active and interesting research areas for cognitive psychologists and behavioral scientists [8]. Facial expressions are the changes in facial features in response to internal mental states, social experiences or intention of individuals. Literature on psychology suggests that isolating facial features like eyes, mouth, nose etc., is necessary for human cognitive system to identify face identity [11, 12, 13, 14]. The shape of facial features, color, facial hair and texture differ with ethnic background, sex and age. These facial features may affect robustness of facial expression recognition. Also, facial features can be occluded by beard, spectacles or by other faces in crowd, which provide more challenges to recognize facial expression accurately [8]. Hence, individual facial features or segment analysis can play an important and central role to recognize facial expression correctly.

In this work, Convolution Neural Networks is used to train different models for chosen facial features: full Face, half face (left/right), upper half face, lower half, eyes, mouth and nose. These facial feature models are compared and analyzed in the context of facial expression recognition. All the models are then integrated using stacked generalized ensemble method to develop an efficient and improved facial expression recognition system. The proposed Stacked Generalized Convolution Neural Networks Facial Expression Recognition (SGCFER) integrated system achieves better accuracy compared to individual models for FER and is shown to be efficient and robust in emotion recognition under various constraints such as partial occlusion, pose, age and illumination variations. The database used for training is CMU Multi-PIE database [15]. The proposed system uses CNN with RPA compliance and is also configured on Raspberry Pi [16], which can be used for HRI and Industry 4.0 applications.

The main contributions of the proposed work are as follows: A new methodology is proposed to recognize facial expressions through facial images using CNN and stacked generalization. Seven different FER models are implemented and analyzed using seven facial features. Based on the analysis, an integrated system, Stacked Generalized Convolution neural networks Facial Expression Recognition (SGCFER) is proposed to improve accuracy for FER task. A detailed cognitive-analysis of facial features in the context of facial expression recognition has been carried out using the seven models. In previously reported literature only upper/lower parts of face are analyzed for FER [17, 19].

The overview of rest of the paper is as follows: reviews of the related work and contributions are discussed in Section 2. Section 3 describes proposed seven CNN models for seven facial features and architecture of integrated SGCFER. Section 4 presents results, analysis and discussion. Conclusion and future directions are in Section 5.

2. Related work

The related work is presented here under two aspects related to the proposed work: Facial features segmentation for facial expression recognition and CNN based facial expression recognition.

2.1 Facial features segmentation for facial expression recognition

Facial feature or facial components such as eyes, nose, mouth, chin and forehead are important aspects that define facial expressions. In 1995, Ekman and Friesen proposed Facial Action Coding System (FACS) also known as Action Units (AUs) [17]. They proposed that the combination of different AUs is used to represent specific facial expression. In 2006, Pantic used 27 AUs and presented how to achieve automatic detection of AUs and their temporal segments in a face-profile image sequence [18]. In 2008, Kotsia et al. used upper face, lower face and left/right half face segment to analysis the effect of occlusion for facial expression recognition task using Gabor filter classification [19]. In 2011, Cotter used eye and mouth segmentation for Fusion of Local Sparse Representation Classifiers (FLSRC) and combined both with square block occlusion (size:50 x50) [20]. In 2014, Cheng et al. used Gabor filters and deep neural networks for facial expression recognition with JAFFE database [21]. They used eyes, mouth, lower face and upper face for occlusion challenge. Liu et al., proposed Weber Local Descriptor histogram feature and decision fusion for facial expression recognition by dividing facial image into many non-overlapping rectangular regions of equal size [22]. Liu et al., used deep action units graph network (DAUGN) with segmentation method for small key areas of face for facial expression recognition [23]. In 2020, Deb et al., used eye and mouth patches for parallel CNN model to recognize facial expression [24].

In the literature so far, authors have used some segmented regions of face for facial expression recognition, however, there is no analysis to determine the extent to which each facial feature and different areas of interest between features contribute to facial expression recognition task. Also, to find how these segmented areas can be ensembled to get more accurate classification under different occlusion conditions. This paper addresses these issues and proposes a system to improve the accuracy.

2.2 CNN based facial expression recognition

In the recent years, CNN has been used widely in many images classification tasks which includes facial expression recognition. Jung et al., used CNN and Deep Neural Networks (DNN) techniques to recognize emotions in real time [25]. Deng et al. have proposed deep network using three convolution layers with more number of filters and two FC layers with Real-world Affective Face Database (RAF-DB) [26]. Mayya et al. have used transfer learning technique [27]. Yang et al., also used transfer learning pre-trained VGG16 network to recognize emotions. They have used JAFFE, CK+ and Oulu-CASIA databases to train top layer of network [28]. In 2018, Siqueira et al., proposed ensemble-based CNN semi-supervised learning technique for facial expression recognition [29]. CNN based multiple face emotion recognition is proposed by Saxena et al. in 2019 [30]. This work involved full facial image to extract feature vectors for training and testing. In 2020, Sun et al., proposed CNN with attention mechanism for region of interests (ROIs) feature extraction [31]. In 2020, Shao et al. used Edge-aware Feedback CNN (E-FCNN) for facial expression recognition [32]. Li et al. used reinforcement-learning techniques with DCNN for facial expression classification, using image selector and rough emotion classifier [33]. In 2021, Saurav et.al., proposed a deep integrated CNN model, which consisted of two structurally similar CNN models and their integrated variant, jointly-optimized using a joint-optimization technique to predict facial expression [34].

Table 1
Algorithm for proposed stacked CNN facial expression recognition system SGCFER

Algorithm

Step 1. Image Pre-processing {Crop, Resize} Step2. Trained individual CNN models for seven facial segments Step3. Create Submodellist() and append trained models Sttep4. for each model Mi Layer.trainable

=

False stacked_input

=

[model.inputfor model insubmodel] Step4. Merge outputs of individual Models to stack: Merge

=

Concatenate{CNN1.output, CNN2.output,

\ldots

CNN7.output} Step4. Trained Stacked Meta model: D1

=

Dense (10, activation

=

ReLu)(Merge) D2

=

Dense(5, activation

=

softmax)(D1) stacked_model

=

Model (inputs

=

stacked_input, outputs

=

D2) stacked_model.compile( ) stacked_model.fit( ) stacked_model.predict( ) Step5. Evaluation on test data

Figure 1.

Block diagram of SGCFER architecture.

In summary, most of the work reported, uses holistic approach where full face is considered and individual facial features are not used explicitly to recognize facial expression. In the literature, there exists a gap to recognize how CNN works for individual facial features and facial regions explicitly. In this proposed work, analysis of individual facial features and implementation of a CNN based model for each facial feature and different region of interest which contribute to facial expression recognition has been carried out. Later on, we combine individual models using stacked generalized ensemble method to improve accuracy for FER task for different parts of facial occlusions.

3. Proposed architecture of SGCFER system

Researchers have made notable contributions in facial expression recognition using facial features and CNN and demonstrated that facial features (eyes or mouth etc.) contribute significantly to shape facial expressions [17, 26, 27, 28, 29, 30, 31, 32, 33, 34, 36].

Figure 2.

(a) Few sample of CMU Multi-PIE database (b) Preprocessing of raw image to crop and resize in 93x63 (These images are used for Full-face model) (c) Few Samples of created database for other six facial segment models after cropped from CMU Multi-PIE database (after preprocessing of raw images as in 2(b)).

Enlightened by these facts, we propose a new method using convolution neural networks and stacked generalization to improve accuracy of recognition in images with occlusion constraint. This section describes the sub models used for seven facial segments and integrated CNN stacked generalization architecture. The block diagram of proposed architecture and algorithm are shown in Fig. 1 and Table 1 respectively.

Table 2

Details of Layered CNN network for individual seven segments

Layers

Input Shape: 93x69x3 Convolution2Dlayer: {Filters (3,3,3): 128, Padding: same; Stride [1 1]; ReLu Activation function} BatchNormalization, Maxpooling: {(2,2); Stride [2 2]} Convolution2Dlayer: {Filters (3,3,3): 256, Padding: same; Stride [1 1]; ReLu Activation function} BatchNormalization, Maxpooling: {(2,2); Stride [2 2]} Convolution2Dlayer: {Filters (3,3,3): 256, Padding: same; Stride [1 1]; ReLu Activation function} BatchNormalization, Maxpooling: {(2,2); Stride [2 2]} Flatten Dense layer: 64 ReLu Activation function layer Dense Layer: 5 Softmax Function

Figure 3.

CNN Layered Model for seven individual facial segments.

3.1 Image preprocessing

The CMU Multi-PIE database is used for training and testing. The dimension size of full raw image is 3072x2048. Total 337 subjects posed different facial expression in four sessions under different pose and illumination conditions as shown in Fig. 2(a). Using pre-processing, Full face segment is cropped from raw image to remove background information as shown in Fig. 2b which results in image dimension of 260x 205. Further, as illustrated in Fig. 3. image pre-processing is also used to crop the six facial segments from full face segment. All the seven facial segments are resized to 93x69 for reducing computational requirement, which is the smallest dimension for which facial expression recognition task would be possible [37].

3.2 CNN based sub model for seven facial segments

To analyze seven facial segments on common ground, we use same framework and use same layered CNN model for the seven segments illustrated in Fig. 3. Details of layers are given in Table 2. The proposed CNN layered network consists of three convolution layers for feature extraction. Each convolution layer is succeeded by batch normalization layer and max pooling layer. During learning process, non-linearity element is added as Rectified Linear Units (ReLu) activation function in each convolution layer, to get more flexibility and to create complex function. Two dense layers of vector dimensions 64, 5 respectively are used for classification. Softmax function is used to classify 5 emotions (Anger, Disgust, Happy, Neutral and Surprise). The detailed architecture of CNN and its formulation are as follows:

Convolution neural networks: Convolution neural networks are series of layers (convolution layer with activation function, pooling layer, dense layer) connected to each other in different ways. These layers take input as feature map from other layers and transform it into another using differential function.

Convolution layers with activation function: Convolution layers are the first layer of CNN, which takes image as input. In this layer input image is convolved with filters or kernels. The kernel matrix is passed over the image which transforms it and gives output feature map based on its kernel weights. The output feature map can be determined as follows:

If, $M=\left[{{\begin{array}[]{ccc}{a_{11}}&\cdots&{a_{1w}}\\ \vdots&\ddots&\vdots\\ {a_{h1}}&\cdots&{a_{hw}}\\ \end{array}}}\right]$ and $K=\left[{{\begin{array}[]{ccc}{k_{11}}&\cdots&{k_{1l}}\\ \vdots&\ddots&\vdots\\ {k_{k1}}&\cdots&{k_{kl}}\\ \end{array}}}\right],$ Kernel filter $K$ is placed over selected pixels of image $M$ . Each kernel weights are multiplied with the corresponding images pixel values. Finally, all product values are summed up to give corresponding one element of output feature map. This operation is performed for complete image and results in output feature map matrix, $C=\left[{{\begin{array}[]{ccc}{c_{11}}&\cdots&{c_{1n}}\\ \vdots&\ddots&\vdots\\ {c_{m1}}&\cdots&{c_{mn}}\\ \end{array}}}\right]$ depending on stride value and padding value.

The convolution of image and filters can be expressed mathematically as:

$\displaystyle C\left({m,n}\right)=\left({M\ast K}\right)\left[{m,n}\right]% \mathop{\sum}\nolimits_{i}\mathop{\sum}\nolimits_{j}=K\left({j,k}\right)M\left% [{m-j,n-k}\right]$ (1)

where $C\left({m,n}\right)$ is output feature map, $M(h,w)$ is input image an $K(k,l)$ is kernel. The size of output matrix is $m$ , $n$ .

The dimension of output matrix is calculated using following equations:

$\displaystyle\left[{h_{o},w_{o},n_{f}}\right]=\left[{h,w,n_{c}}\right]\ast% \left[{k,k,n_{c}}\right]=\left[\left\lfloor\frac{h+2p-k}{s}+1\right\rfloor,% \right.\left.\left\lfloor\frac{w+2p-k}{s}+1\right\rfloor,n_{f}\right]$ (2)

$\displaystyle p=\frac{k-1}{2}\ \text{for same convolution and}p=0\ \text{for % valid convolution}$ (3)

where, $\left[{h,w}\right]$ is dimension of input feature map, $n_{c}$ number channels in the image, $k$ , $p, s$ and $n_{f}$ are filter kernel size, zero padding, stride, number of filters respectively. $\left[{h_{o},w_{o}}\right]$ is dimension of output feature map.

The convolution operation is followed by an activation function $\Psi$ . More preciously, at the $l^{th}$ layer, let the input image $a^{\left[{l-1}\right]}$ with dimension ( $n_{h}^{\left({l-1}\right)}$ , $n_{w}^{\left({l-1}\right)}$ , $n_{c}^{\left({l-1}\right)})$ , $a^{\left[0\right]}$ being the image input with Padding: $p^{\left[l\right]}$ , stride: $s^{\left[l\right]}$ . The number of filters: $n_{C}^{\left(l\right)}$ where each $K^{\left[n\right]}$ has dimension: ( $f^{\left[l\right]}$ , $f^{\left[l\right]}$ , $n_{c}^{\left({l-1}\right)})$ . The bias of the $l^{th}$ convolution: $b_{n}^{\left(l\right)}$ , activation function: $\Psi^{\left[l\right]}$ and output $a^{\left[l\right]}$ with dimension ( $n_{h}^{\left(l\right)},n_{w}^{\left(l\right)},n_{c}^{\left(l\right)})$ .

And we have:

$\displaystyle C\left({a^{\left[{l-1}\right]},K^{\left[n\right]}}\right)=\Psi^{% \left[l\right]}\mathop{\sum}\nolimits_{i}\mathop{\sum}\nolimits_{j}K^{\left[n% \right]}\left({j,k}\right)$ $\displaystyle\quad a^{\left[{l-1}\right]}\left[{m-j,n-k}\right]+b_{n}^{\left(l% \right)}$ (4) $\displaystyle\quad\forall n\in\left[{1,2,\ldots n_{c}}\right]:$

$\displaystyle\text{and dim}(C\left({a^{\left[{l-1}\right]},K^{\left[n\right]}}% \right))=\left({n_{h}^{\left(l\right)},n_{w}^{\left(l\right)}}\right)$ (5)

Thus output of convolution with activation function is:

$\displaystyle a^{\left[l\right]}=\Psi^{\left[l\right]}\left({C\left({a^{\left[% {l-1}\right]},K^{\left[1\right]}}\right)}\right),\Psi^{\left[l\right]}\left({C% \left({a^{\left[{l-1}\right]},K^{\left[2\right]}}\right)}\right),\ldots\Psi^{% \left[l\right]}\left({C\left({a^{\left[{l-1}\right]},K^{\left({n_{C}^{\left(l% \right)}}\right)}}\right)}\right)$ (6)

$\displaystyle\text{and dim}(a^{\left[l\right]})=\left({n_{h}^{\left(l\right)},% n_{w}^{\left(l\right)},n_{C}^{\left(l\right)}}\right)$ (7)

with:

$\displaystyle n_{h/w}^{\left(l\right)}=\left\{\begin{array}[]{c}\left\lfloor{% \frac{n_{h/w}^{\left({l-1}\right)}+2p^{\left[l\right]}-f^{\left[l\right]}}{s^{% \left[l\right]}}+1}\right\rfloor\\ {n_{h/w}^{\left({l-1}\right)}+2p^{\left[l\right]}-f^{\left[l\right]};s=0}\\ \end{array}\right.;s>0.$ (8)

The learned parameters at the $l^{th}$ layer are:

$\displaystyle\text{Filters with}\left({f^{\left[l\right]}\times f^{\left[l% \right]}\times n_{c}^{\left({l-1}\right)}}\right)\times n_{c}^{\left(l\right)}$ $\displaystyle\quad\text{parameters and Bias with}\ (1\times 1\times 1)\ n_{c}^% {\left(l\right)}$ (9)

Pooling layer: Pooling layer is used for down sampling the number of features of the input without changing number of channels after convolution layer.

If pooling layer is after $l^{th}$ convolution layer, the output of $l^{th}$ layer will be input for $l+1^{th}\text{pooling layer}$ . Hence, if input: $a^{\left[l\right]}$ with dimension ( $n_{h}^{\left(l\right)},n_{w}^{\left(l\right)},n_{c}^{\left(l\right)})$ , $a^{\left[0\right]}$ being the image input with Padding: $p^{\left[{l+1}\right]}$ , stride: $s^{\left[{l+1}\right]}$ , size of the pooling filter: $f^{\left[{l+1}\right]}$ and pooling function: $\emptyset^{\left[{l+1}\right]}$ . The output: $a^{\left[{l+1}\right]}$ with dimensn ( $n_{h}^{\left({l+1}\right)},n_{w}^{\left({l+1}\right)},n_{c}^{\left({l+1}\right% )}=n_{c}^{\left(l\right)})$ we can assert that:

$\displaystyle a^{\left[{l+1}\right]}=\emptyset\left({a^{\left[l\right]}}\right% )=\emptyset^{\left[{l+1}\right]}\left({a^{\left[l\right]}}\right)$ (10)

$\displaystyle\text{and dim}(a^{\left[{l+1}\right]})=\left({n_{h}^{\left({l+1}% \right)},n_{w}^{\left({l+1}\right)},n_{C}^{\left(l\right)}}\right)$ (11)

with

$\displaystyle n_{h/w}^{\left(l\right)}=\left\{\begin{array}[]{c}\left\lfloor{% \frac{n_{h/w}^{\left({l-1}\right)}+2p^{\left[l\right]}-f^{\left[l\right]}}{s^{% \left[l\right]}}+1}\right\rfloor\\ n_{h/w}^{\left({l-1}\right)}+2p^{\left[l\right]}-f^{\left[l\right]};s=0\\ \end{array}\right.;s>0.$ (12)

The learned parameters are zero for pooling layer.

Batch Normalization: In the intermediate layers, the internal covariate shift problem arises because of the constant change in distribution of the activations during training. This continuous change slows down the training process since each layer adapts to a new distribution in every training step. To address this iernal covariate problem batch normalization is used and inputs are normalized at each layer [38].

In training phase, if each batch has $N$ samples and b batches batch normalization can be formulated as:

$\displaystyle\textit{BN}=\frac{\gamma}{\sqrt{\textit{var}_{x}+\epsilon}}x+% \left({\beta+\frac{\gamma E_{x}}{\sqrt{\textit{var}_{x}+\epsilon}}}\right)$ (13)

where, $E_{x}=\frac{1}{N}\sum_{i=1}^{b}u_{B}^{\left(i\right)}$ is inference mean and $\textit{var}_{x}=\left(\frac{N}{N-1}\right)\frac{1}{N}\sum_{i=1}^{b}\sigma^{2(% i)}_{B}$ is inference variance of $b$ batches with $N$ samples. $\mu_{B}=\frac{1}{N}\sum_{i=1}^{N}x_{i}$ is batch mean and $\sigma_{B}^{2}=\frac{1}{N}\sum_{i=1}^{N}(x_{i}-\mu_{B})^{2}$ is batch variance of layer inputs.

$\displaystyle\gamma\ \text{and}\ \beta\ \text{are learning parameters for batch}$ $\displaystyle\quad\text{normalization layer.}$ (14)

Dense layer: A dense layer consists of finite neurons that takes a one-dimensional vector as input and returns another one dimensional vector.

Mathematically, considering the $j^{th}$ node of the $i^{th}$ layer we have the following equations:

$\displaystyle z_{j}^{\left(i\right)}=\sum_{l=1}^{n_{i-1}}w_{j.l}^{\left[i% \right]}a_{l}^{\left[{i-1}\right]}+b_{j}^{\left[i\right]}\to a_{j}^{\left[i% \right]}=\Psi^{\left[i\right]}(z_{j}^{\left(i\right)})$ (15)

The input $a^{\left[{i-1}\right]}$ is the result of a convolution or a pooling layer with the dimensions ( $n_{h}^{\left({i-1}\right)}$ , $n_{w}^{\left({i-1}\right)}$ , $n_{c}^{\left({i-1}\right)})$ . In order to be able to plug it into the dense layer we flatten the tensor to 1D vector having the dimension: ( $n_{h}^{\left({i-1}\right)}\times n_{w}^{\left({i-1}\right)}\times n_{c}^{\left% ({i-1}\right)})$ , thus:

$\displaystyle n_{i-1}=n_{h}^{\left({i-1}\right)}\times n_{w}^{\left({i-1}% \right)}\times n_{c}^{\left({i-1}\right)}$ (16)

The learned parameters at the $l^{th}$ layer are:

$\displaystyle\text{weights}\ w_{j,l}\ \text{with}\ n_{l-1}\times n_{l}\ \text{% parameters and}$ $\displaystyle\quad\text{bias with}\ n_{l}\ \text{parameters.}$ (17)

Softmax classifier: The softmax function is used to classify and find estimated probability of each class. Mathematically, softmax function is given as:

$\displaystyle p_{j}=\textit{softmax}\left(\bar{\sigma}\right)\left[z\right]_{j% }=\frac{e^{z_{j}}}{\sum_{k=1}^{K}e^{\bar{\sigma}_{k}}}\text{for}\ j=1,2\ldots K% ,\left[z\right]_{j}=\left({z_{1},z_{2},\ldots z_{K}}\right)\in\mathbb{R}^{K}$ (18)

$K=5$ for proposed facial expression recognition system.

Learning algorithm: For learning algorithm in CNN, weights of learning parameters (see Eqs (3.2), (3.2) and (3.2)) is calculated to minimize loss function, $J\left(\theta\right)$ which is the objective function. It quantifies distance between the predicted and actual values on the overall training set. To minimize $J\left(\theta\right)$ following two steps are used:

Forward propagation: First, the data propagates through CNN in batches. The loss function is calculated on batch which is the sum of the difference between actual and predicted values.

Back propagation: Backpropagation is used to calculate the gradients of the cost function $L$ with respect to each $\theta$ model parameters. Stochastic Gradient Descent (SGD) is applied to update and find best learning parameters. The same process is iterated for number of epochs.

The learning algorithm is formulated as:

First, model parameters are initialized randomly. Then, for $i=1,2\ldots E$ : ( $E$ is the number of epochs) forward propagation is performed as follows:

$\forall i$ , compute the predicted value of $(x_{i},y_{i})$ through the CNN: $\hat{y}^{\theta}_{i}$

$\displaystyle J\left(\theta\right)=\frac{1}{n}\mathop{\sum}\limits_{i=1}^{n}L% \left(\hat{y}^{\theta}_{i},y_{i}\right)$ (19)

where $n$ is the size of the training set, $\theta$ the model parameters and $L$ the cross-entropy cost (*) function which is given by:

$\displaystyle L\left(\hat{y}^{\theta}_{i},y_{i}\right)=-\Sigma_{x}y_{x}\log(% \hat{y}^{\theta}_{i})$

Stochastic Gradient Descent is applied for backpropagation to update the learning parameters:

$\displaystyle\theta=:\textit{SG}\left(\theta\right)$ (20)

(*) The cost function evaluates the function between the actual and predicted value on a single point.

3.3 Stacked generalized CNN (SGC) model

After implementation of seven individual models for each facial segment, these models are tested for unseen cross-corpus seven model datasets. These datasets include images, which are not seen by trained models and have cropped facial segments (see Fig. 2b). Testing of these models on cross-corpus datasets will prove the ability to apply these trained models for occlusion and hidden faces constraints and propose one generalized system which can be used for various FER applications under different constraints. Further, a stack generalized method is used to combine individual outputs and used as input to train deep neural network (meta learner) to improve accuracy of these models for facial expression recognition under different occlusions. Stacking ensemble is a learning technique in which multiple classification sub models (base models) are combined with stacking ensemble model (meta-classifier). The base level models are trained using training set and later higher-level meta-model is trained on the outputs of the base level models as features vector and tested [38]. At level-0 sub models are embedded in neural network, which learns and combines the prediction from each sub model (Fig. 4). At level-1, two dense layers with 35, 5-dimensional vectors are used for meta classifier. Softmax function is used for classification in last dense layer with cross entropy loss function.

Figure 4.

Stacked generalized CNN model.

Table 3

Stacking algorithm

Algorithm

Input: training data

T_{j}:\left({X,Y}\right)=\left\{{x_{i},y_{i}}\right\}_{i=1}^{n}

j=1,2\ldots 7

training dataset for sub models. Output: Stacking ensemble model classifier

\tilde{N}

Step1: learn sub models level classifier for

i=

1 to

N

do learn

\theta_{i}

based on

T_{j}

end for Step2: construct new data set of predictions for

i=

1 to

L

T_{j_{\theta}}=\left\{{x_{i}^{\prime},y_{i}}\right\}

, where

x_{i}^{\prime}=\{\theta_{1}(x_{i}),\ldots,\theta_{N}(x_{i})\}

end for Step3: learn stacking ensemble model classifier learn

\tilde{N}

based on

T_{j_{\theta}}

return

\tilde{N}

To test robustness and efficiency of this method different samples are used.

The estimated probabilities of output of submodels $M_{i}$ is calculated by softmax function as given below:

$\displaystyle p_{ij}=\textit{softmax}\left(\bar{\sigma}_{i}\right)[j]=\frac{% \bar{\sigma}_{i}}{\sum^{K}_{k=1}e^{\bar{\sigma}_{k}}}\text{for}\ j=1,2\ldots 5$ (21)

where, $\bar{\sigma}_{i}$ is the predicted probability of the output from last layer of CNN $i^{th}$ sub model, $\bar{\sigma}_{[k]}$ is the predicted probability corresponding to $k^{th}$ class, $p_{ij}$ is predicted probability for $j^{th}$ in $i^{th}$ sub model.

After combining, these models the predicted probability of all the sub models $M_{i}$ which is input to the first dense layer SGCNN is:

$\displaystyle a_{l}=\textit{concate}\left\{p\right\}_{\left({i\in 1,2\ldots 5}% \right),j}=\textit{concate}\left\{{\textit{softmax}\left(\bar{\sigma}_{i\in 1,% 2,\ldots 5}\right)}[j]\right\}=\frac{\bar{\sigma}_{i}}{\sum^{K}_{k=1}e^{\bar{% \sigma}_{k}}}\ \text{for}\ j=1,2\ldots 35$ (22)

The output of the dense $l^{th}$ layer for $j^{th}$ is given by:

$\displaystyle z_{j}^{\left(i\right)}=\mathop{\sum}\nolimits_{l=1}^{n_{i-1}}w_{% j.l}^{\left[i\right]}a_{l}^{\left[{i-1}\right]}+b_{j}^{\left[i\right]}\to a_{j% }^{\left[i\right]}=\Psi^{\left[i\right]}(z_{j}^{\left(i\right)})$ (23)

Table 4

Hyper parameter for Proposed CSGFERS

Models	Hyper parameters	Values
Seven facial segment models	SGD Optimizer Learning rate Decay Momentum	0.01 $10^{-6}$ 0.9
	Epoch	200
	Batch size	32
	Step size per epoch	280
SGCNN model	Adam optimizer Learning rate Epsilon Beta1 Beta 2	0.001 $1e^{-7}$ 0.9 0.999
	Epoch	200
	Batch size	590
	Step size per epoch	1

Table 5

Recognition accuracy of seven segment models for facial expressions

Dataset	Accuracy
Full Face	98.96%
Half face	98.47%
Upper Face	84.21%
Lower Face	90.84%
Mouth	89.03%
Nose	74.52%
Eye	70.10%

Table 6

Recognition Accuracy of seven segment models for facial expressions with cross dataset (a) Full Face (b) Half Face (c) Upper face (d) Lower Face (e) Mouth (f) Nose (g) Eye

Figure 5.

Training and validation accuracy plot for individual seven segments models (a) Full face (b) Half Face (c) Upper face (d) Lower face (e) Mouth (f) Nose (g) Eye .

The input $a^{\left[{i-1}\right]}$ is the result of concatenation with the 1D vector ( $N=35$ ). And $\Psi^{\left[i\right]}$ is activation function (ReLu).

The learned parameters at this layer are: $\displaystyle\quad\text{weights}\ w_{j,l}\ \text{with}\ n_{l}\ \text{% parameters and bias with}$ $\displaystyle\quad n_{l}\ \text{parameters.}$ (24)

The output of this dense layer is given as input to last dense layer with 1D vector ( $N=5$ ). Later, same as individual sub models softmax function classifier with cross entropy loss function for learning algorithm is used to find optimum learning parameters and classification (see Eqs (18)–(20)).

The results, analysis and experimentation details are discussed in following section.

Figure 6.

Consolidated plot for Recognition Accuracy of seven segment models for facial expressions with cross dataset (a) Full Face (b) Half Face (c) Upper face (d) Lower Face (e) Mouth (f) Nose (g) Eye.

Figure 7.

Training and validation accuracy plot for seven segments models with cross datasets for full face (a) Full face-Half face (b) Full Face-Upper face (c) Full Face-Lower face (d) Full face-Mouth (e) Full face- Nose (f) Full face-Eye.

Figure 8.

Training and validation accuracy plot for seven segments models with cross datasets for half face (a) Half face-Full face (b) Half face-Upper face (c) Half face-Lower face (d) Half face-Mouth (e) Half face-Nose (f) Half face-Eye.

Figure 9.

Training and validation accuracy plot for seven segments models with cross datasets for upper face (a) Upper Face-Full face (b) Upper face-Half face (c) Upper Face-Lower face (d) Upper Face-Mouth (e) Upper Face-Nose (f) Upper Face-Eye.

Figure 10.

Training and validation accuracy plot for seven segments models with cross datasets for lower face (a) Lower face- Full face (b) Lower face-Half face (c) Lower Face-Upper face (d) Lower face- Mouth (e) Lower face- Nose (f) Lower Face-Eye.

Figure 11.

Training and validation accuracy plot for seven segments models with cross datasets for mouth (a) Mouth-Full face (b) Mouth-Half Face (c) Mouth-Upper face (d) Mouth-Lower face (e) Mouth-Nose (f)Mouth-Eye.

Table 8

Comparison with state of the art

References	Approach & Facial Segment used	Over all facial expression recognition accuracy
[19]	Gabor filter & upper face, lower face and left/right half face segments	91.60%
[21]	Gabor filter and DNN & eyes, mouth, lower face and upper face	85.71%
[22]	Weber Local Descriptor histogram feature and decision & rectangular segment of equal size of face	94.74%
[23]	DAUGN & small key areas of face	96.67%
[24]	CNN & eye and mouth	92.02%
[25]	CNN and DNN & full face	85.86%
[26]	CNN & full face	68.20%
[27]	transfer learning pre-trained network & full face	98.12%
[28]	transfer learning pre-trained VGG16 network & full face	97%
[29]	ensemble-based CNN semi-supervised learning technique & full face	88.79%
[30]	CNN & full face	95.80%
[31]	CNN& full face	92%
[33]	DCNN& full face	72.35%
[34]	Integrated CNN & full face	98.52%
Proposed CSG model	Stacked Generalized Convolution neural networks & Seven facial segments	99.70%

Figure 12.

Training and validation accuracy plot for seven segments models with cross datasets for nose (a) Nose-Full face(b) Nose-Half face (c) Nose-Upper face (d) Nose-Lower face (e) Nose-Mouth (f) Nose-Eye.

Figure 13.

Training and validation accuracy plot for seven segments models with cross datasets for nose (a) Nose-Full face (b) Nose-Half face (c) Nose-Upper face (d) Nose-Lower face (e) Nose-Mouth (f) Nose–Eye.

Figure 14.

Accuracy-loss plot for CSGFERS (a) Full face (b) Half Face (c) Upper face (d) Lower face (e) Mouth (f) Nose (g) Eye.

Figure 15.

Comparison of CGSFERS with seven individual model accuracies shown in Table 5 and Table 7 (in seven datasets (a) Full face (b) Half Face (c) Upper face (d) Lower face (e) Mouth (f) Nose (g) Eye).

Figure 16.

Facial emotion recognition results (a) offline images (b) Real time.

Figure 17.

Failure cases of Facial emotion recognition offline images and in real time.

4. Results, analysis and discussion

CMU- MultiPIE database is used for training, validation and testing the proposed method. Total 10,590 images are used from the database. 74,130 images are used for training and validation of 10,000 per seven facial segment CNN models. 4130 images are used for stacked generalized CNN model to test the accuracy of the proposed SGCFER for unseen data with 590 images used for each of seven segments. For individual facial segment model, out of 10,000 images, total 9000 images where 1800 images per emotion class are used namely Neutral, Anger, Happy, Surprise and Disgust, for training and a total of 1000 images where 200 images per emotion are used for validation for each of seven facial segment models. The optimum values for hyper parameters used to train seven facial segments models and SGC model are shown in Table 4. The accuracy achieved by the proposed seven models: Full face, Half face, Upper face, Lower face, Mouth, Nose and Eyes are 98.96%, 98.47 %, 84.21%,90.84%, 89.03%,74.52% and 70.1% respectively for individual seven segment dataset and shows promising results (see Table 5 and Fig. 5). However, when all seven models are tested on cross-corpus datasets, very low accuracy is achieved. The achieved accuracies for seven models are reported in Tables 6a–g and consolidated in Fig. 6. The training and validation plots for the seven models with cross-corpus dataset are shown in Fig. 7a–g–Fig. 13a–g. After analysis of facial feature segment results, it is observed that full face gives best accuracy for FER. Since no occlusion is present and all features are available to learn, hence there is less confusion in recognizing facial expression. In half-face occlusion there is a difference of only 0.63%. Since half face is symmetrical to other half and all the features of full face are present, the accuracy is almost same. The accuracies achieved for lower and upper occlusion are 8% and 14% less compared to full face respectively. FER accuracy is more if lower half face is present and upper half is occluded compared to lower face occlusion for upper face segment.

However, accuracies for FER are reduced by 11%–19% compared to full face, if only mouth, nose and eyes segments are visible. It is observed from the results that mouth plays an important role compared to nose and eyes in facial expression recognition. Influence of mouth is more on neutral, happy and disgust expressions compared to other two facial features eye and nose. In CMU Multi-PIE database anger and surprise are expressed with mouth open which increases confusion and decreases recognition rate. Eyes play a major role in case of these two expressions as eyes are open in surprise and closed in anger and model can learn difference between them and hence improve recognition rate. With these observations, the proposed SGCNN model shows an improved accuracy of upto 30% for cross-corpus datasets compared to individual seven facial segment models. The accuracies are reported in Table 7 for cross-corpus datasets and accuracy plot is shown in Fig. 14a–g. Comparison with seven individual models with corresponding datasets is shown using the plot in Fig. 15. The snapshots of results for facial emotion recognition of offline images and in real time are shown in Fig. 16a–b. The system fails to recognize emotion correctly in some cases as shown in Fig. 17. The implementation of the proposed method is done in the environment of python 3.7 using keras TensorFlow as backend. Intel i7-7700 CPU @ 3.60 GHz and 16 GB RAM operating system windows 7 and GeForce GTX1660 with CUDA 9.0 GPU with 4 GB RAM, Linux operating system are used to perform benchmarks. OpenCV are used to resize and crop images to create datasets for seven facial segments. Result of proposed CGS model is compared with existing literature as shown in Table 8. The overall facial recognition accuracy for full face is used for comparison. Compared to the state-of-art techniques, IFERIS achieved precision with various variations in illumination and head pose for facial emotion recognition shows considerable improvement.

5. Conclusion and future work

In this work, analysis on contribution of key seven facial segments for facial expression recognition task has been carried out using CNN. These segments can be used to recognize facial expressions under occlusion constraints. Seven individual datasets are created from CMU Multi-PIE database raw facial images. Facial expression recognition accuracies are reported for each of the seven facial segment models individually. These models were also tested for unseen cross-corpus datasets and accuracies are reported. It is observed that FER does not perform well under cross-corpus dataset testing. Hence, stacked generalized CNN model, SGCFER is proposed in which first, all seven facial segment trained models are concatenated together as base learners for stacked generalized ensemble method and used as input to train meta-learners. Deep neural network with two dense layers is used for meta learner in SGC model. Results shows up to 30% improvement for cross-corpus datasets. The proposed method can be used for Human Machine Interaction (HMI) applications which involve face occlusion and hidden face constraints.

Future work would involve testing the system in real time environment for human robot interaction. Also, physiological signals would be explored and combined with proposed method to improve accuracy for facial expression recognition with the help of these signals.

Footnotes

Acknowledgments

The authors would like to thank all the volunteers for participating in the experimentation and also would like to thank the host organization for providing CMU Multi PIE database. We thank all other researchers for making other relevant databases available for such research experiments.

References

Happy

Dasgupta

Patnaik

Routray

. Automated Alertness and Emotion Detection for Empathic Feedback during e-Learning. IEEE Fifth International Conference on Technology for Education (2013), Kharagpur, 2013, pp. 47-50. doi: 10.1109/T4E.2013.19.

Goodfellow

, et al., Challenges in Representation Learning: A Report on Three Machine Learning Contests. Workshop Challenges in Representation Learning (ICM12013), 2013, pp. 1-8. doi: 10.1007/978-3-642-42051-1_16.

Picard

. Affective Computing: challenges. International Journal Human Computer Studies, 2003. doi: 10.1016/S1071-5819(03)00052-1.

Fragopanagos

Taylor

. Emotion Recognition in Human–Computer Interaction. Neural Networks, Elsevier. 2005; 18(4): 89-405. doi: 10.1016/j.neunet.2005.03.006.

Amelsvoort

Joosten

Krahmer

Postma

. Using non-verbal cues to (automatically) assess children’s performance difficulties with arithmetic problems, Computers in Human Behavior. 2013; 29(3): 654-664, ISSN 0747-5632, doi: 10.1016/jchb.2012.10.016.

Coco

et al., Study of mechanisms of social interaction stimulation in autism spectrum disorder by assisted humanoid robot, in IEEE Transactions on Cognitive and Developmental Systems. Dec. (2018; 10(4): 993-1004. doi: 10.1109/TCDS.2017.2783684.

Jan

Meng

Gaus

YFBA

Zhang

. Artificial intelligent system for automatic depression level analysis through visual and vocal expressions, IEEE Transactions on Cognitive and Developmental Systems. Sept. 2018; 10(3): 668-680, doi: 10.1109/TCDS.2017.2721552.

Ioanna-Ourania

George

. Tsihrintzis: Visual affect recognition. Frontiers in Artificial Intelligence and Applications 214, IOS Press 2010, ISBN 978-1-60750-596-9, pp. 1-247.

Ioanna-Ourania

Efthymios

George

. Tsihrintzis, Maria Virvou: On assisting a visual-facial affect recognition system with keyboard-stroke pattern information. Knowl. Based Syst. 2010; 23(3): 350-356.

10.

Darwin

. The expression of emotions in man and animals. John Murray, reprinted by Universityof Chicago Press, 1965; 1872.

11.

Davies

Ellis

Shepherd

. Perceiving and remembering faces, Academic Press, 1981.

12.

Kumar

M.P

Rajagopal

. Detecting facial emotions using normalized minimal feature vectors and semi-supervised twin support vector machines classifier. Appl Intell. 2019; 49: 4150-4174. doi: 10.1007/s10489-019-01500-w.

13.

Alexandros

Konstantinos

Stavros

Maja

. Lips Don’t Lie: A Generalisable and Robust Approach To Face Forgery Detection, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 5039-5049.

14.

Alexandros

Rodrigo

Stavros

Maja

. Leveraging real talking faces via self-supervision for robust forgery detection, arXiv preprint arXiv: 220107131, 2022.

15.

Gross

Matthews

Cohn

Kanade

Baker

. Multi-PIE. Proc Int Conf Autom Face Gesture Recognit. 2010; 28(5): 807-813.

16.

Warren

. 2014. Raspberry Pi Hardware Reference (1st. ed.). Apress, USA.

17.

Ekman

. and Friesen

. The Facial Action Coding System: A Technique for the Measurement of Facial Movement. Consulting Psychologists Press, San Francisco, 1978.

18.

Pantic

Patras

. Dynamics of facial expression: recognition of facial actions and their temporal segments from face profile image sequences, in IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics). April 2006; 36(2): pp. 433-449. doi: 10.1109/TSMCB.2005.859075.

19.

Kotsia

, et al. An analysis of facial expression recognition under partial facial image oc-clusion. Image and Vision Computing. 2008; 26(7): 1052-1067.

20.

Cotter

. Recognition of occluded facial expressions using a Fusion of Localized Sparse Representation Classifiers. IEEE Digital Signal Processing Workshop and IEEE Signal Processing Education Workshop, 2011; pp. 437-442.

21.

Cheng

, et al. A Deep Structure for Facial Expression Recognition under Partial Occlusion. Tenth International Conference on Intelligent Information Hiding and Multimedia Signal Processing, 2014; pp. 211-214.

22.

Liu

, et al. Facial expression recognition under partial occlusion based on Weber Local Descriptor histogram and decision fusion. 33rd Chinese Control Conference, 2014, pp. 4664-4668.

23.

Liu

Zhang

Lin

Wang

. Facial Expression Recognition via Deep Action Units Graph Network Based on Psychological Mechanism, in IEEE Transactions on Cognitive and Developmental Systems. June 2020; 12(2): 311-322. doi: 10.1109/TCDS.2019.2917711.

24.

Deb

Choudhury

Sharma

Talukdar

Laskar

. Frontal Facial Expression Recognition using Parallel CNN Model, 2020 National Conference on Communications (NCC), Kharagpur, India, 2020, pp. 1-5. doi: 10.1109/NCC48643.2020.9056011.

25.

Jung

, et al., Development of Deep Learning-based Facial Expression Recognition System, 21st Korea-Japan Joint Workshop on Frontiers of Computer Vision (FCV), Mokpo, pp. 1-4. doi: 10.1109/FCV.2015.7103729.

26.

Deng

Zhang

Guo

. DeepEmo: Real-world Facial Expression Analysis via Deep Learning. International Conference on Visual Communications and Image Processing (VCIP), Singapore, pp. 1-4. doi: 10.1109/VCIP.2015.7457876.

27.

Mayya

Pai

. Automatic Facial Expression Recognition Using DCNN. Procedia Computer Science, vol. 93; pp. 453-461. doi: 10.1016/j.procs.2016.07.233.

28.

Yang

Cao

Zhang

. Facial expression recognition using weighted mixture deep neural network based on double-channel facial images. IEEE Access, vol. 6; pp. 4630-4640. doi: 10.1109/ACCESS.2017.2784096.

29.

Siqueira

Barros

Magg

Wermter

. An Ensemble with Shared Representations Based on Convolutional Networks for Continually Learning Facial Expressions. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, 2018; pp. 1563-1568. doi: 10.1109/IROS.2018.8594276.

30.

Saxena

Tripathi

Sudarshan

TSB

. Deep Dive into Faces: Pose & Illumination Invariant Multi-Face Emotion Recognition System, 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 2019, pp. 1088-1093. doi: 10.1109/IROS40897.2019.8967874.

31.

Sun

Zheng

. ROI-Attention Vectorized CNN Model for Static Facial Expression Recognition, in IEEE Access, vol. 8; 2020, pp. 7183-7194. doi: 10.1109/ACCESS.2020.2964298.

32.

Shao

Cheng

. E-FCNN for tiny facial expression recognition. ApplIntell. 2020. doi: 10.1007/s10489-020-01855-5.

33.

. Deep reinforcement learning for robust emotional classification in facial expression recognition, Knowledge-Based Systems, vol. 204; p. 106172, ISSN 0950-7051, doi: 10.1016/j.knosys.2020.106172.

34.

Saurav

Saini

Singh

. EmNet: a deep integrated convolutional neural network for facial emotion recognition in the wild. Appl Intell. 2021; 51: 5543-5570. doi: 10.1007/s10489-020-02125-0.

35.

Calvo

Gutiérrez-García

Líbano

. What makes a smiling face look happy? Visual saliency, distinctiveness, and affect. Psychol Res. 2018; 82(2): 296-309.

36.

Tian

Kanade

Cohn

. Facial Expression Analysis. In: Handbook of Face Recognition. Springer, New York, NY. doi: 10.1007/0-387-27257-7_12.

37.

Ioffe

Szegedy

. Batch normalization: Accelerating deep network training by reducing internal covariate shift, in International conference on machine learning, 2015, pp. 448-456.

38.

David

. Stacked generalization, Neural Networks. 1992; 5(2): 241-259, ISSN 0893-6080. doi: 10.1016/S0893-6080(05)80023-1.

Facial expression recognition under constrained conditions using stacked generalized convolution neural network

Abstract

Keywords

1. Introduction

2. Related work

2.1 Facial features segmentation for facial expression recognition

2.2 CNN based facial expression recognition

Table 1 Algorithm for proposed stacked CNN facial expression recognition system SGCFER

3.2 CNN based sub model for seven facial segments

5. Conclusion and future work

Footnotes

Acknowledgments

References

Table 1
Algorithm for proposed stacked CNN facial expression recognition system SGCFER