Hybrid heuristic mechanism for occlusion aware facial expression recognition scheme using patch based adaptive CNN with attention mechanism

Abstract

In computer vision, the recognition of expressions from partially occluded faces is one of the serious problems. By the prior recognizing techniques it can solve the issue by various assumptions. A benchmark-guided branch was proposed for detecting and eliminating the manipulated features from the occluded regions since the human visual system is proficient for eliminating the occlusion and the appropriate focal point was obtained on the non-occluded areas. In recent years deep learning has attained a great place in the recognition of facial reactions Still, the precision of facial expression is affected by the occlusion and large skew. In this research work, a deep structure-based occlusionaware facial expression recognition mechanism is introduced to provide superior recognition results. Firstly, the required image is taken from publically provided online sources and the gathered images are subjected to the face extraction method. The face extraction method is done via the ViolaJones method for the extraction of redundant patterns from the original images. Secondly, the extracted face features are given to the pattern recognition stage, where the Adaptive CNN with Attention Mechanism (ACNN-AM) is introduced. This mechanism automatically forms the occluded region of the face and the focal point was on the most discriminative un-occluded regions. Moreover, the hidden patterns in the Occlusion aware facial expressions are identified through the Hybrid Galactic Swarm Yellow Saddle Goatfish Optimization (HGSYSGO). Finally, the overall effectiveness of the developed occlusion aware facial expression recognition model is examined through the comparative analysis of different existing baseline recognition techniques.

Keywords

Facial expression recognition facial images Viola-Jones method high detection rate adaptive convolutional neural networks with attention mechanism hybrid galactic swarm yellow saddle goatfish optimization

1. Introduction

In social communication, facial expression has played a significant role in our day-to-day life [1]. Recognition of facial expression has gained more awareness due to their huge implications, which includes video conferencing, health care, virtual reality, cognitive science and driver safety. A great approach of human emotional communication is facial expression [2]. The facial expressions comprise of 55%, whereas voice and language comprise of 38% and 7%, correspondingly. The automatic facial expression recognition with computer vision and artificial intelligence has been implied in diverse areas like human-computer interaction, sign language recognition and human behavior recognition, etc. [3]. The facial expressions are divided into numerous categories, such as anger, fear, disgust, happiness, surprise and sadness [4]. The facial Action Coding System (FACS) was evolved to encrypt the movement of various facial muscles in which Action Units (AUs) describe the facial movement [5]. Most of the developed models are performed by using these action units [6]. Also, to obtain the features for facial expression recognition, the Facial Animation Parameters (FAPs) model was developed.

Using body posture, facial expression, gesture and speech, human emotions can be recognized [7]. By evaluating stress, marketing, medicine, stress and intelligent learning, emotion recognition is a subsidiary [8]. Through facial expressions, the humans can communicate their emotional state to the observer in a natural way [9]. By the position of the muscles and movements, the information and intentions are received from the human face [10]. The facial expression and recognition of emotions have received more attention in different areas of computer vision, which affects computing, human-computer interaction and pattern recognition. It performs a crucial role in huge implementation such as humanoid robots, telecommerce, patient monitoring systems, mood prediction, sentiment analysis and clinical and crime psychology. Regarding the movement of muscles, our face has the default feature of facial recognition [11]. Therefore, demanding key component is the specific feature extraction for recognizing emotions efficiently [12]. The feature extraction becomes more challenging due to the variance in the facial features because of the background noise, illumination variation, pose variation and uneven lighting.

With the popularity of smart devices, deep learning and machine learning-based face recognition technology is experiencing remarkable growth with the rapid evolution in our lives [13]. In a short period of time, facial expression recognition has obtained huge recognition in the field of driving, medical treatment, human-computer interaction, which has become a famous analysis topic in education sectors and industrial sectors [14]. Facial expression recognition means extracting the particular facial expression from the available stable image or motion video sequence to show the intellectual feeling of the object which is recognized [15]. Usually, the available images with a facial expression recognition algorithm are to obtain the features that perform to decide different facial expression categories notably in conventional methods [16]. A covariance pooling layer was discovered by Acharya et al. to obtain the deformations in the regional facial features and the early evolution of each frame features. However, the above introduced methods attained better results greatly, facial expression recognition is still a demanding method because of the occurance of the partially congested faces. Given the variation in poses, Zhang et al. imposed an adversarial autoencoder under several reactions and poses to enhance the performance using face images. In these methods, the dataset is recorded in controlled environments, where the facial images are frontward [17]. Under natural and uncontrollable variations, these models give poor performance in the recognition of human expressions.

The key scope of the developed deep learning-based occlusion aware facial recognition is summarized below.

•
To design an efficient occlusion aware facial expression recognition approach by deep learning framework to recognize the expressions from the occluded facial images, and it provides highly recognized facial expression recognition results with high accuracy.
•
To implement a HGSYSGO for optimizing the parameters epochs, learning rate and hidden neuron count in Adaptive CNN to enhance the precision and accuracy in the developed occlusion aware facial expression recognition model.
•
To develop an attention-based Adaptive CNN mechanism for recognizing the facial expressions, where the parameters are tuned by the developed HGSYSGO to get rid of occlusion region from the facial regions, and finally it recognize the expressions accurately.
•
To validate the performance of the developed model is analyzed over different conventional facial expression recognition approaches and algorithms to enhance the performance.

The rest of the developed facial recognition approach is systemized as follows: Section 2 explains the existing facial expression techniques with advantages and drawbacks. Section 3 explains the architectural representation of the offered method and the dataset description. Section 4 describes the designed algorithm and the preprocessing using Viola-Jones method. Section 5 tells about the facial expression recognition stage by the attention-based Adaptive CNN. Sections 6 and 7 summarizes the results and performance analysis, and conclusion.
2. Discussion on literature works

2.1 Related works

In 2020, Kim et al. [18] have developed a new strategy where the input given here are the recording videos, depth and color to estimate the dimensional emotion states. This network is known as Multi-Modal Recurrent Attention Networks (MRAN), in which facial recognition is recognized on the basis of attention-boosted feature volumes by learning spatiotemporal attention volumes. The guidance prior for color sequence depth and the thermal sequence was proposed to focus the emotional discriminative regions. For multi-modal facial expression, the standard landmark called Multi-Modal Arousal-based-Valence Facial Expression Recognition (MAVFER) was imposed with thermal recorded videos, color, and depth with the addition of continuous arousal-valence scores. The color recorded datasets having “RECOLA, SEWA and AFEW, and a multi-modal recording dataset including MAVFER” and geographical dimensional facial expression recognition, have resulted the extraordinary uses contrasted to other techniques.

In 2017, Ding et al. [19] have implied a facial expression recognition system based on Local Binary Pattern (LBP) with a double layer to obtain the peak expression frame from the video. This developed method was used to reduce detection time with lower-dimensional size. In order to handle the radiant difference in LBP, the domain named Logarithm-Laplace (LL) was additionally implied for the efficient and strong facial feature for detection. They suggested the Feature Pattern of Taylor (FPT) on the basis of Taylor expansion and LBP to extract sturdy facial features for detection. In the end, the theorem based on Taylor expansion is used to obtain the facial expressions. Depending on Taylor expansion and LBP, an efficient facial feature from the Taylor feature map, the FPT was introduced. The practical results on the JAFFE and Cohn-Kanade data sets were shown enhanced performance. The JAFEE database, FG Net database and CK $+$ database were considered for making experiments, and results gained satisfying categorization on emotions.

In 2021, Arumugam et al. [20] have introduced a sub-band wavelet gradient transform with sub-band selection for the efficient recognition of facial emotions. The wavelet feature consists of both dimensional and spectral domain information that is used to find human emotions through facial expressions. To get the gradient of sub-band, the gradient transform method was explored. Also, to improve the quality of images, the calculation in the edges was performed. With the application of the principal of the Pearson–kernel component analysis method the dimensions were reduced in the obtained features. With the introduced membership of the Gaussian function fuzzy SVM classifier, the facial features were particularly selected and categorized.

In 2021, He et al. [21] have introduced a new technique for finding the expressions of face and recognition of AUs based on their relevancies on the convolutional networks. For the separation and detection of information of facial reactions via the de-expression learning procedure, a conditional Generative Adversarial Network (GAN) was introduced. After that, the representation of dependency lying in the group of AU nodes was achieved by applying a graph convolutional network and the nodes were embedded by the classification of expression components into various patches concerning the AU-related regions. At last, they described the relevancies of expressions and AUs, and in addition, they combined them to loss function to impel the model. Throughout the analysis, the results have shown that this recommended work outperformed than any other popular methods.

In 2020, Chikontwe et al. [22] have demonstrated the learning of generative and particular demonstrations for pose-unvarying face recognition based on a GAN architecture that was used to detach the identity and variations in the pose. Rather than using a single generator an iterative warping scheme performed better results. For face recognition, the features considered by the encoder were posing invariant and through the estimation of databases, this proposed system has shown better results than other methods. As for example, the precision achieved on Feret and Caspeal was high when compared with other methods without the process of warping. Particularly, there exist two notable innovations. Firstly, the synthesis of frontward faced via an encoder-decoder structure in the generator with the differences in the pose was given to the decoder and discriminator, and the performance was based on disentangled architecture GAN. Secondly, in the geometric warp parameter, the real image was synthesized using the generator encoder, and it acted as a spatial transformer network.

In 2022, Liang et al. [23] have explored a convolution-transformer Network with a dual branch that has benefited on facial expression in both local and global details to equip the real-world occlusions and head-pose variant robust FER. This network consisted of two branches. In the analysis of this local modeling capacity of CNN, the local edge details were obtained using CNN in the first branch. In second branch, transformers were introduced in the natural processing of language to achieve fair representation all over the world. Next, hybrid features were used to combine the features, and local–global feature fusion module was explored also, they modified the relationship between those two features. Using this module, the network not only hybrid the features but also learned various features by itself. This experimental results under the evaluation of inner-database and cross-database results in facial expression databases concluded that this introduced method performed well other than the conventional methods and attained efficient execution in a wide range.

In 2020, Hu et al. [24] have proposed a module of occlusion detection depending on symmetric SURF for the purpose of finding the occluded facial images, which were used to find the occlusion area having the horizontal symmetric area . Under the supervision environment, a mirror transition face in painting was introduced to achieve face in painting in quick manner. In addition, a heterogeneous soft partitioning based recognition network was introduced to identify facial reactions. After partitioning, the input was the weights in each part, and this recognition network was useful for the training. At last, for the identification of the reaction the neural network was fed with weighted inputs. By analysis of this developed model, it was revealed that this method has a higher rate of precision when compared with state-of-the-art methods on fer2013 and Cohn-Kanade (CK $+$ ) datasets. Apart from that, the execution time of this proposed technique was faster than advanced methods.

In 2019, Li et al. [25] have suggested a CNN with Attention mechanism (ACNN) that was able to differentiate the occluded sectors in the face and the focal point on particular un-occluded sectors. The various characterizations were combined from facial Regions of Interest (ROIs). By this proposed gate unit, the characterization has been weighted and calculated the ductile weight from the sector by its own. With the consideration of various RoIs, the two version of ACNN was employed namely, global-local-based ACNN (gACNN) and patch-based ACNN (pACNN). For local facial patches, the pACNN has more attention to obtain features. Based on visualization results, ACNN was able to change the focus from the patches of occluded regions to other areas. When compared to other conventional methods, the datasets used in-the-lab facial expression in the cross-dataset evaluation protocol; the performance of ACNN was fair.

In 2022, Ye et al. [34] have developed the Convolutional Neural Network and Attention Long Short-Term Memory (CNN-ALSTM) for recognizing the facial expression. Hence, the designed method was incorporated with the two-layer attention mechanism (ACNN-ALSTM) was implemented. Here, the experiments were conducted by two datasets Fer2013 and CK $+$ Data Set. Throughout the analysis, the developed model has revealed the effective performance of the designed model when compared with other baseline approaches.

2.2 Problem statement

Table 1
Existing occlusion aware facial expression recognition models

Author [citation]	Techniques	Benefits	Limitations
Kim et al. [18]	MRAN	• It effectively extracts the spatial information from the collected data. • Cost effectiveness is low in this model.	It does not have the ability to model the variability of the temporal factors from the facial expressions.
Ding et al. [19]	DLBP	• It sensitively captures the different voice, face and biological emotion signals for improving the system’s accuracy. • It has low computational cost.	• It tries to extract the motional differences from the continuous frames. • The illuminations of the facial expressions are not effectively defined.
Arumugam et al. [20]	SMSWT	• It directly minimizes the photometric consistency in spatial and temporal patterns. • It provides high robustness and reliability.	• It does not cover the complete range of human emotion variations. • While validating with the performance metrics, it does not provide accurate results thus, it slow down the reliability of the system model.
He et al. [21]	GCNN	• It is highly sensible to solve the local radiance deviation problem. • The convergence rate of the model has high.	• It is more troublesome and computationally very expensive. • The pose variations are not effectively determined.
Chikontwe et al. [22]	GAN	• It helps to get a more efficient facial expression feature. • It highly reduces the time complexity.	This method mainly requires manual selection for the determination of subjective imposition of thresholds.
Liang et al. [23]	CT-DBN	It solves the problem of pose differences and data sparsity while considering some uncontrolled conditions.	The automatic facial expression detection models with some uncontrolled conditions are needed for this model.
Hu et al. [24]	SURF	• It effectively enables the geometric corrections of the input image in the facial feature extraction process. • The accuracy and precision of this model were high.	• It mainly focussed on recognizing the relation among various temporal feature frames. • Several difficulties were acquired while processing with a large amount of data.
Li et al. [25]	ACNN	• The preservation of global and local details are high in this model. • It requires low resources for computation.	It has low generalization performance because of the usage of conditional probabilities.

The recognition of facial emotions has an important activity for designing computer-oriented applications in neuroscience and cognitive science. The occlusion of large skews may affect the accuracy of the facial recognition scheme. However, the differentiation in the poses and facial expressions is efficiently identified by using several deep structure-based approaches. The features and the challenges of the developed deep structures are shown in below Table 1. MRAN [18] effectively extracts spatial information from the collected data. And also, the cost effectiveness is low in this model. But, it does not have the ability to model the variability of the temporal factors from the facial expressions. DLBP [19] sensitively captures different voice, face and biological emotion signals for improving the system’s accuracy, and therefore, it has low computational cost. Yet, it tries to get the motional differences from the continuous frames. Furthermore, the illuminations of the facial expressions are not effectively defined. SMSWT [20] directly minimizes the photometric consistency in spatial and temporal patterns. In addition, it provides high robustness and reliability. Nonetheless, it does not cover the complete range of human emotion variations. Furthermore, the reliability of the system is low. GCNN [21] is highly sensible to resolve the local radiance fluctuations problem. Subsequently, the convergence rate of the model has high. But, it is more troublesome and computationally very expensive. In addition, the pose variations are not effectively determined. GAN [22] helps to get a more strong facial expression feature. Consequently, it highly reduces the time complexity. However, this method mainly requires manual selection to determine the subjective imposition of thresholds. CT-DBN [23] solves the problem of pose difference and data sparsity while considering some uncontrolled conditions.

Nevertheless, the automatic facial expression detection models with some uncontrolled conditions are needed for this model. SURF [24] effectively enables the geometric corrections of the input image in the feature extraction process, and also, the accuracy and the precision of this model were high. But, it mainly focused on recognizing the relation among various temporal feature frames. In addition, several difficulties were acquired while processing large amounts of data. ACNN [25], the preservation of global and local details are high in this model. In addition, it requires low resources for the computation. But, it has low generalization performance because of the usage of conditional probabilities. Therefore, the challenges in the facial expression methods are overcome by the newly developed deep structured based strategy.

3. Experimented datasets and intelligent framework for occlusion aware facial expression recognition

3.1 Experimented datasets

The data needed to find the occluded aware facial expression recognition are obtained from the website https://github.com/savya08/Occluded-Facial-Expression-Recognition (access date: 13-12-2022). This collected dataset was based on recognizing the facial expressions in occluded images using non-occluded images as useful details. The three standard datasets used here are AffectNet, FED-RO and RAF-DB. Also, the utility files in the dataset are data_loader.py, README.md, models.py, utils.py and trainer.py.

The dataset 1 (AffectNet) is a new database created to detect facial expressions by collecting annotating facial images from the above-mentioned online source. It consists of more than 1M facial images are collected from the web by analysing three search engines using 1260 emotions-based keywords in six various languages. About half of the restored image was elucidated for seven discrete reactions of the face and the intensity of valence and activation. To classify the images, two baseline deep neural network is used.

The dataset 2 (RAF-DB) is a huge database with around 40 k facial images gathered from the above mentioned online source. Each and every image can be labelled by 40 annotators based on crowdsourcing annotation. In this database, images are in great differences in subjects head poses, lighting conditions, age, gender, occlusions and post-processing operations. This dataset contains huge quantities and rich annotations. It is classified into two sets, namely the training set and the test set for the performance measurement. Here, the size of the training set is five times larger than the test set.

The input images garnered from the AffectNet, and RAF-DB standard dataset are denoted by $\textit{FR}_{T}^{\textit{INP}}$ . Here, the term $T$ is taken as $T=1,2,\ldots,n$ , where $n$ describes the total number of images. Here, the collected sample images of occluded facial expressions of dataset 1 are given in below Fig. 1.

Figure 1.

Collected sample facial expressions from the traditional databases.

3.2 Proposed framework

The major limitations of facial expression recognition are the fast attaining rate and depending on the wild performance. Under the guarded lab conditions, the works are created, and tested based on the dependencies of standard datasets, the bias is produced. The lab imagery displays subjects, which preserves a frontal or near-frontal headpose. Under controlled illumination conditions, without considering self occlusions and the quality of the image is usually high. Using video clip stimuli, the behaviour is extracted otherwise adds human-computer interaction. In particular both schemes decrease the complication of the data. The next drawback is the unification of the examination of human facial expressions research in a high-level framework modelling. The human behaviour is examined by various methods, in which facial behaviour identification is one of the main aspects. By observing all the expressions in the face, we are able to understand the human’s emotions. In the case of extracting a good picture, considering the multi-modal view of demonstration helps to formulate the message and to enhance the performance of each of the particular sub-problems along with the analysis of automatic facial expression. The architectural representation of the newly implemented occlusion-aware facial expression recognition model is given in Fig. 2.

Figure 2.

Systematic representation of developed deep learning-based occlusion aware facial expression model.

A new deep learning-based occlusion aware facial expression recognition approach is developed to detect the occlusions from the images to provide efficient results over facial recognition. Moreover, this developed occlusion aware facial expression recognition approach is used to improve the precision and accuracy rate excessively. The facial images are gathered from the benchmark databases. The facial patterns from the images are extracted from the facial images by using the Viola Jones method, and then the cropped image is extracted as output. Next, the cropped image is fed as input to the classification stage; where attention-based adaptive CNN is utilized to form the occlusion regions to recognize the expression of individuals. Here, the parameters such as epochs, learning rate and hidden neuron count are optimized with the utilization of offered HGSYSGO. The objective function of this parameter optimization is to attain maximum accuracy and precision rate. The recognized results from the developed occlusion aware facial expression recognition model are compared over different existing occlusion aware facial expression recognition approaches and heuristic algorithms in order to ensure the effectiveness.

4. Face pattern extraction and the development of hybrid galactic swarm yellow saddle goatfish optimization for parameter optimization

4.1 Viola Jones algorithm for pattern extraction

The Viola-Jones algorithm is implemented because it effectively extracts the relevant patterns from the facial images. The viola-Jones method [26] uses the input image as $\textit{FR}_{T}^{\textit{PRE}}$ . It classifies the facial features by finding the small features and then compares it with the sub-image. The Viola-Jones algorithm is a real-time processing method for occlusion aware facial expression detection. This algorithm helps to detect facial features accurately. The smaller subregions were looked over by the algorithm, and it tries to detect the face by searching particular features in each subregion. The image size varies for all humans, so it checks the various positions and scales. As there are relevant features in the human face, this was used as the haar feature. Viola Jones algorithm searches particular haar features in the face. If the feature is detected, then it proceeds with the next step. Here, the whole image is considered as a rectangular image part called a sub window of size of 24*24 pixels.

The haar feature is built with two or three rectangles. The haar feature is utilized for the detection purpose to find whether the face is present or absent. There is a particular value for each haar feature, and by taking the area of each rectangle value is formulated, and the result is added. In simple manner, the area of rectangle is found based on the concept of integral image. The value at any location $(y,z)$ of the integral image is the sum of the image’s pixel beyond and to the left of location $(y,z)$ . The weight of each rectangle is multiplied by its area by the Haar classifier, and the results are added atlast in the given in Eq. (1).

$\displaystyle\sum_{y^{\prime}\leqslant y,z^{\prime}\leqslant z}ii(y,z)=i(y^{% \prime},z^{\prime})$ (1)

In the pre-processing phase the integral image $F$ can be expressed in Eqs (2) and (3).

$\displaystyle S(y,z)=S(y,z-1)+I(y,z)$ (2) $\displaystyle F(y,z)=F(y-1,z)+S(y,z)$ (3)

Here, the terms $S$ and $F$ are initialized by $S(y,-1)=0$ and $F(-1,z)=0$ respectively.

In segmentation, the output is $\textit{FT}_{U}^{\textit{IO}}$ . The total addition of intensities of a rectangles ranges from $(y,z)$ to $(y1,z1)$ that is computed by considering the values of $F$ at the four cover points of the region in contrary of summing all the intensities of all the pixels are shown in the below Eq. (4).

$\displaystyle\sum_{b\equiv y}^{y1}\sum_{c\equiv z}^{z1}I(b,c)=F(y,z1)-F(y,z1)-% F(y1,z)-F(y,z)$ (4)

The value of the feature is determined by the Haar feature classifier by the rectangle integral for the computation. The weight of each rectangle is multiplied by the haar classifier by the rectangles area and the results are added atlast. These classifiers are divided into various stages. The results are added by the stage comparator, and these added values are compared with a stage threshold. From the Ada Boost algorithm the threshold constant is gained. There is no fixed set number for haar features. In the cascading stage, we can easily encounter the false individual and eliminate. It is eliminated when it does not cross the first stage. If it passes then they proceed to the next stage, and finally the face is detected. The output patterns obtained from the Viola-Jones method is represented as $\textit{ML}_{V}^{\textit{ON}}$ . The extracted output patterns using the Viola-Jones method are shown in Fig. 3.

Figure 3.

Resultant face detection from the Viola-Jones algorithm.

4.2 Proposed HGSYSGO

The developed HGSYSGO algorithm is used in the developed occlusion aware facial expression recognition model to improve the recognition rate of the developed occlusion aware facial expression recognition model by optimizing the parameters taken from the CNN. The parameters to be optimized from the Adaptive CNN are learning rate, hidden neuron count and epochs. The GSO helps to provide multiple cycles of exploration and exploitation by splitting the search spaces. Consequently, the global optimum gets prevented because of the trapping of local optimum. YSGO helps to enhance the optimization results, but, the convergence rate is low. To resolve these challenges, the HGSYSGO is introduced with newly modified concept based on the fitness function. In our proposed hybrid algorithm $r_{1}$ is the random number to be upgraded with the newly modified concept. The formula used to compute calculates $r_{1}$ is shown in Eq. (5).

$\displaystyle r_{1}=(\textit{ctfit}-\min\textit{fit})\big{/}(\max\textit{fit}-% \min\textit{fit})$ (5)

The variable $\max\textit{fit}$ denotes the maximum fitness value, $\min\textit{fit}$ defines the minimum fitness value, and the term ctfit represents the current fitness value. If the value of $r_{1}$ exceeds 0.5 value, then the GSO is updated and find the solution using GSO or else it is updated using YGSA and find the solution using YGSA. In the conventional YGSA and GSO, the random parameter $r_{1}$ is taken randomly, and it produces poor solution over optimization. By using the modified concept for the determination of $r_{1}$ gives better results over optimization.

YSGA: The YSGA approach contemplates the hunting area as the search space. Here, the individuals imitate the group of fish. The algorithm contains two division of search agents, namely blockers and chasers. Among the sub-population, one fish takes the chaser role, and others are observed as blockers. Based on the category, each element undergoes a set of different operations, which copies the various behaviours in the natural hunting process.

A population $Q$ of m goatfishes $\{q_{1},q_{2},\ldots,q_{m}\}$ was produced without planning and uniformly distributed within the boundaries $c^{\textit{high}}$ and $c^{\textit{low}}$ of the $n$ -dimensional search space, where the size of the population is denoted as $m$ and $q_{j}\in q$ is the vector of decision variables denoted as $q_{j}=\{q_{j}^{1},q_{j}^{2},\ldots,q_{j}^{2}\}$ . The modification is shown in Eq. (6).

$\displaystyle q_{j}^{k}=\textit{rand}\cdot(c_{j}^{\textit{high}}-c_{j}^{% \textit{low}})+c_{j}^{\textit{low}}$ (6)

Here, $j=1,2,\ldots,m$ , $k=1,2,\ldots,n$ , $r_{1}$ represents a random number among [0, 1].

The data set is grouped into $k$ number of sub-sets $\{d_{1},d_{2},\ldots,d_{k}\}$ called as clusters, and the mean $\mu_{1}$ of every cluster $d_{1}$ . Taking the population of goatfishes $Q$ as the search set; the squared error between $\mu_{1}$ and the set of data points $\{q_{1},q_{2},\ldots,q_{i}\}$ in the cluster $d_{1}$ is defined in Eq. (7).

$\displaystyle f(d_{1})=\sum_{q_{h}\in d_{1}}\|q_{h}-\mu_{1}\|^{2}$ (7)

Here, $h=1,2,\ldots,i$ ; $m=1,2,\ldots,l$ and $h$ is denoted in the manner of the $k$ -means algorithm, and the value differs for each cluster $d_{1}$ . The function is denoted in Eq. (8).

$\displaystyle F(D)=\sum_{m=1}^{l}f(d_{1})$ (8)

In a group of goatfish, the chaser fish is denoted as $\Phi_{m}\in P$ . The new position of the chaser fish is formulated in Eq. (9).

$\displaystyle\Phi_{m}^{u+1}=\Phi_{m}^{u}+\alpha\oplus\textit{L\'{e}vy}(\beta)$ (9) $\displaystyle 0<\beta\leqslant 2$

The new and latest location of the chaser fish is $\Phi_{m}^{u+1}$ and $\Phi_{m}^{u}$ , respectively. Next, $\alpha$ is the step size and $\alpha=1$ . The product $\oplus$ represents multiplications of entry-wise. The Lévy index is denoted by the parameter $\beta$ . The value of $\beta$ is calculated from Eq. (10).

$\displaystyle\beta=1.99+\frac{0.001u}{\frac{t_{\max}}{10}}$ (10)

Here, the variable $u$ is defined as the latest generation and $t_{\max}$ is the number of maximum repetitions. Usually, each sub population is ignored and obtains the best prey, the characteristics are formulated in Eq. (11).

$\displaystyle u=\alpha\oplus\textit{levy}(\beta)\sim\alpha\left(\frac{v}{|w|% \frac{1}{\beta}}\right)(\Phi_{m}^{u}-\Phi_{\textit{best}}^{u})$ (11)

Here $u$ denotes the random step and $\Phi_{\textit{best}}^{u}$ represents the best chaser fish among the groups. From the normal distribution, $u$ and $v$ are defined in Eqs (12) and (13) respectively.

$\displaystyle u\sim N(0,\sigma_{v}^{2})$ (12) $\displaystyle v\sim N(0,\sigma_{w}^{2})$ (13)

Here, $\sigma_{v}$ and $\sigma_{w}$ are represented, it is given in Eq. (14).

$\displaystyle\sigma_{v}=\left\{\frac{s(1+\beta)\sin\frac{\pi\beta}{2}}{s\left(% \frac{1+\beta}{2}\right)\beta 2^{\left(\frac{\beta-1}{2}\right)}}\right\}^{1/% \beta},\sigma_{v}=1$ (14) $\displaystyle\Phi_{m}^{u+1}=\Phi_{m}^{u}+T$ (15) $\displaystyle\Phi_{\textit{best}}^{u+1}=\Phi_{\textit{best}}^{u}+T^{\prime}$ (16)

Here $T^{\prime}$ is estimated using Eq. (17).

$\displaystyle T^{\prime}=\alpha\left(\frac{u}{|w|\frac{1}{\beta}}\right)$ (17)

The strategy of blocker fish $\Phi_{h}\in Q$ is to surround the corals to stop or block the escape routes for preys when the chaser fish tries to hunt the prey. The new location of the blocker fish is denoted in Eq. (18).

$\displaystyle\Phi_{h}^{u+1}=E_{h}\cdot f^{cq}\cdot\cos 2\pi\rho+\Phi_{m}$ (18)

Here, $\rho$ is a random number among $[a,1]$ that defines the distance between the blocker and chaser. The parameter $c$ denotes the constant that gives shape and direction of the spiral.

The term $E_{h}$ is the distance between the latest location of the blocker fish $\psi_{h}^{u}$ and the chaser fish $\Phi_{m}$ in the cluster $d_{1}$ described in Eq. (19).

$\displaystyle E_{h}=|r\cdot\Phi_{m}-\psi_{h}^{u}|$ (19) $\displaystyle\{\Phi_{m},\psi_{h}^{u}\}\in d_{1}$ (20)

Here, $r$ is the random number among [ $-$ 1, 1]. The prey moves in a hunting area during hunting. So the blocker fish near the prey becomes the latest fish, and the chaser fish becomes as a blocker. This process is called the exchange of roles.

A change of area is performed for all the goatfish in the cluster, as shown in Eq. (21).

$\displaystyle q_{h}^{u+1}=\frac{\Phi_{\textit{best}}+q_{h}}{2}$ (21)

Here, $q_{h}^{u+1}$ denotes the new location of the goatfish and $\Phi_{\textit{best}}$ is the best solution among the clusters and $q_{h}^{u}$ is the latest position of the goatfish.

GSO: In this model, based on the galaxies, movement of stars and super clusters of galaxies, the GSO algorithm works. There is an uneven distribution of stars in the universe they are unevenly dispersed, but they are grouped into galaxies. Overall, the huge galaxies appear as point masses. The stars get drawn to the galaxies in massive amount, and again, the galaxy itself from other great masses is imitated in the GSO algorithm. In the GSO algorithm, a galaxy of stars is parallel to the subswarm and a cluster of galaxies is parallel to the super swarm. The cluster of galaxies is found out using the Centre of Mass (CM) of the galaxies. Likewise, by the individual in the subswarm represents the global best solution. However, in our strategy, the analogy is blocked to clusters of galaxies and galaxies of stars.

In this GSO algorithm, the swarm is a set $Y$ , $E$ represents tuples that consists of elements $({Y_{k}^{(j)}\in\Re^{E}})$ and M partitions known as subswarms $Y_{j}$ with the size $N$ . Within the search space $[\chi_{\min},\chi_{\max}]^{E}$ , elements of $Y$ are described randomly. The complete swarm framework is defined in Eqs (22)–(25) respectively.

$\displaystyle Y_{i}\subset Y:\,j=1,2,\ldots,M$ (22) $\displaystyle Y_{k}^{(j)}\in Y_{j}:\,1,2,\ldots,N$ (23) $\displaystyle Y_{j}\cap Y_{k}=\phi:\textit{if}\,j\neq k$ (24) $\displaystyle\bigcup_{j=1}^{M}Y_{j}=Y$ (25)

Here, $Y_{i}$ is a swarm and its size is denoted as $N$ . The velocity and personal best related with each particle $Y_{k}^{(j)}$ are denoted by $W_{k}^{(j)}$ and $Q_{k}^{(j)}$ , respectively. Each subswarm $Y_{i}$ has a related global best $h^{(j)}$ , $f(q^{(j)})<f(q)$ from subswarm, and the particles are attracted specifically toward the local minimum. However, subswarms will not share the best solutions in other methods because every swarm gets attracted to their own best solutions that might overlap one another. The search region is not applied in the GSO algorithm.

The motion of the subswarm in $Y_{i}$ is not dependent, and they have no influence on other subswarm $Y_{j}$ for $i\neq j$ and also allowing without affecting and extensive possible search. For the utilization of the multiple subswarms, a galactic best is defined $h$ that is upgraded whenever any of the global bests $h^{(j)}$ assumes a function value of minimum $f(h^{(j)})<f(h)$ . By the upgrade feature of $h$ the GSO, algorithm gives the best solutions. When comparing single swarm with multiple swarms the multiple swarms produce a mutual effect, which improves exploration.

The search space is found by the subswarm independently by its own. By the calculation of velocity and position, the iteration starts, and the expressions for velocity and position updates are given in Eqs (26) and (27) respectively.

$\displaystyle w_{k}^{(j)}\leftarrow x_{1}w^{(j)}+d_{1}s_{1}(Q_{l}^{(j)}-y_{k}^% {(j)})+d_{2}s_{2}(Q^{(j)}-y_{k}^{(j)})$ (26) $\displaystyle y_{k}^{(j)}\leftarrow y_{k}^{(j)}+w_{k}^{(j)}$ (27)

Here, the $x_{1}$ denotes the inertial weight and the $s_{1}$ and $s_{2}$ denotes the random number and are represented in Eqs (28) and (29) respectively.

$\displaystyle x_{1}=1-\frac{l}{M_{1}+1}$ (28) $\displaystyle s_{j}=U(-1,1)$ (29)

Equation (9) denotes that $s_{j}$ is a guessed number chosen between the range between [ $-$ 1, 1]. For the formation of superclusters, global bests involves in the next level of superclusters. Parallely, by gathering the global bests from subswarms $Y_{j}$ a new superswarm Z is formed and shown in below Eqs (30) and (31).

$\displaystyle z^{(j)}\in Z:\,j=1,2,\ldots,N$ (30) $\displaystyle z^{(j)}=h^{(j)}$ (31)

The updated equations of the position vectors $z^{(j)}$ and velocity vectors $w^{(j)}$ are given in Eqs (32) and (33) respectively.

$\displaystyle w^{(j)}\leftarrow x_{2}w^{(j)}+d_{3}s_{3}(Q^{(j)}-y^{(j)})+d_{4}% s_{4}(h-z^{(j)})$ (32) $\displaystyle z^{(j)}\leftarrow z^{(j)}+w^{(j)}$ (33)

Here, $q^{(j)}$ is the personal best related with the vector $z^{(j)}$ . The assigning relations of $x_{2}$ , $s_{3}$ and $s_{4}$ are the same as Eqs (28) and (29). When the location is searched at a better point $h$ serves as the global best example. Also, to improve the exploitation the superswarm concentrates on best located global point.

The information is exploited by calculating superswarms that used the best solutions from already computed one by the subswarms. Eventhough the individuals of the superswarms are widely spread when compared with subswarm individuals, the role of superswarm is dependent.

To maintain the unity of solutions, the flow of details from the superswarm to subswarms is influenced by the global best solutions. For the retainment of highly spreaded subswarms and constant global search ability for every time, the feedback should be avoided that helps the GSO strategy.

When the next time starts, the search starts from where it is stopped and with the same accuracy, the search starts to explore the space again, but sub swarms are not restarted. Therefore there is more chance to get local minima for the GSO algorithm, and this is the main thing because any local minimum can eventually be the global minimum. This continual exploration-exploitation cycle can be inbuilt in response for the outperformance of the GSO algorithm.

The pseudocode of the developed HGSYGSO-ACNN-AM-based occlusion aware facial expression recognition method is expressed in Algorithm 1.

Algorithm 1: Recommended HGSYGSO
Assign the parameters of algorithm
Do population initialization
For $i=1$ to $\textit{MX}_{\textit{IR}}$
For $i=1$ to $\textit{NU}_{\textit{PN}}$
If $r>0.5$
Perform GSO algorithm for solution update
Else
Perform YGSA for solution update
End if
Compute the new position
Else
Compute the fitness function
Update and extract the best solution using fitness computation
End for
Get the best solution
End

5. Enhanced occlusion aware facial expression recognition using adaptive CNN with attention mechanism

5.1 CNN architecture

In our proposed work, we have used adaptive CNN [27] for providing a better precision and accuracy rate over the recognition of occlusion aware facial expressions. The extracted features from the viola-Jones are given as the input of the CNN $\textit{ML}_{V}^{\textit{ON}}$ . It finds the important facial features itself without the guidance of human supervision. To obtain high quality images and high performance, the artificial intelligence neural network called CNN was used during the convolution process. The CNN model has a pooling layer, convolution layer input, output and fully-connected layer. By multiple convolution and operations of pooling, the propagation of the transformed time-domain vibration image occurs, and the respective fault output is gained. The whole process includes a) forward propagation and b) back propagation for parameters update.

a) Forward Propagation Process: The fully connected layer and pooling layer convolution come under the first method. The convolution and pooling achieve the learning of facial features and representations. In the convolutional layer, the convolution kernels with various sizes and shapes and local input image features can be obtained. Incase $y$ is the input feature, the operation of convolution can be formulated in Eq. (34).

$\displaystyle y=f\left(\sum y*x_{jk}+c\right)$ (34)

Here, $c$ represents the additive bias, the term $*$ represents the convolution computation, $x_{jk}$ denotes the convolutional kernel, and the additive bias is denoted by $c$ . The Pooling layer is present behind the convolutional layer, and they helps in decreasing the size of convolutional feature maps and for the blockage of dimensionality issues. It can be shown in Eq. (35).

$\displaystyle y=f(\textit{down}(y)\times x+c)$ (35)

Here, multiplicative bias is denoted by $x$ . Pooling function is represented by $(\bullet)$ . The fully connected layer is needed to map distributed feature representations to the ample tag space. FC can be shown in Eq. (36).

$\displaystyle h(y)=f(xy+c)$ (36)

Here, $h(y))$ and $y$ are the output and input of the layer which is fully connected. The term $y$ denotes the weight value, and the Softmax is denoted by $(\bullet)$ and for the division of tasks it is an activation function.

b) Back Propagation for Parameters Update: Next, the Back Propagation (BP) algorithm was implied to reduce the error between the real target value and model prediction by having with the minimum loss function $E_{\textit{loss}}$ by updating the weights and biases of each layer network. In this work, the cross-entropy loss function is applied, and it can be expressed in Eq. (37).

$\displaystyle E_{\textit{loss}}=\frac{1}{n}\sum_{k=1}^{n}\left[z_{k}\ln u_{k}+% (1-z_{k})\ln(1-u_{k})\right]$ (37)

Then, the size of the sample is denoted by $n$ , and the actual tag value is represented by $z_{k}$ and $u_{k}$ . An optimizer is adopted in order to minimize the loss function $E_{\textit{loss}}$ . The basic model of CNN is given in Fig. 4.

Figure 4.

Basic construction of the CNN model.

5.2 Adaptive CNN architecture

In this developed occlusion aware facial expression recognition model, the Adaptive CNN with parameter optimization are used to increase the precision and accuracy over the recognition. The parameters to be optimized in the CNN are learning rate, hidden neuron count and epochs by developed HGSYGSO. CNN is used to detect patterns in images to recognize objects, classes, and categories very effectively. But the gradient value gets exploded in CNN. Hence, the parameters are optimized to increase the performance over facial recognition. The objective function of the designed facial recognition model with parameter optimization is given in Eq. (38).

$\displaystyle\textit{FN}=\mathop{\arg\min}\limits_{\{\textit{LE}_{W}^{\textit{% CNN}},\textit{HN}_{D}^{\textit{CNN}},\textit{EP}_{C}^{\textit{CNN}}\}}\left(% \frac{1}{\textit{AR}+\textit{PN}}\right)$ (38)

Here, the accuracy is denoted by AR , and the precision is denoted as PN. The variable $\textit{LE}_{W}^{\textit{CNN}}$ is the optimized learning percentage value in CNN and $\textit{HN}_{D}^{\textit{CNN}}$ denotes the optimized hidden neuron count in the CNN, and $\textit{EP}_{C}^{\textit{CNN}}$ represents the optimized epochs in CNN. The learning rate is optimized in the interval of [5, 225]. In CNN, in the interval of [0.01–0.99], the hidden neurons are optimized. In CNN, between the intervals of [50–100], the epochs are optimized. The term accuracy is denoted as AR, and the estimation is based on negative and positive values on observation. The value of accuracy is denoted using below Eq. (39).

$\displaystyle\textit{Acy}=\frac{(\textit{TE}_{\textit{pe}}+\textit{TE}_{% \textit{ne}})}{(\textit{TE}_{\textit{pe}}+\textit{TE}_{\textit{ne}}+\textit{FE% }_{\textit{pe}}+\textit{FE}_{\textit{ne}})}$ (39)

The “true positive, true negative and also the false positive and false negative values are indicated by the terms $\textit{TE}_{\textit{pe}}$ , $\textit{TE}_{\textit{ne}}$ , $\textit{FE}_{\textit{pe}}$ , and $\textit{FE}_{\textit{ne}}$ respectively”.

The term precision is termed as the ratio of detected positive value correctly to the summation of all observations that are positively detected, as given in Eq. (40).

$\displaystyle\textit{precision}=\frac{\textit{TE}_{\textit{ne}}}{\textit{TE}_{% \textit{pe}}+\textit{TE}_{\textit{ne}}}$ (40)

The true positive and true negative values are indicated by the terms $\textit{TE}_{\textit{pe}}$ and $\textit{TE}_{\textit{ne}}$ , respectively.

5.3 Adaptive CNN with attention mechanism architecture

To enhance the accuracy and precision occlusion aware facial expression recognition, the CNN model with attention mechanism is introduced. Moreover, by using this attention mechanism, the hidden patterns are effectively learned. A set of query and key-value pairs are present here to map them to a loaded addition output of all values. For the input with key dimension $e_{k}$ , examination, and values of dimension $e_{w}$ , for attention, the output matrix can be represented in Eq. (41).

$\displaystyle\textit{Attention}(R,L,W)=\textit{soft}\max\left(\frac{RL^{T}}{% \sqrt{e_{l}}}\right)W$ (41)

Here, $L$ and $W$ denotes the keys and value matrices and $R$ represents the queries matrices. By assembling multiple “Scaled Dot-Product Attention,” the Multi-head attention is explored. The outputs can be expressed in Eq. (42).

$\displaystyle\textit{Multihead}(R,L,W)=\textit{concat}(\textit{head}1,\ldots,% \textit{head}h)W^{\circ}$ (42)

Here, the term $\textit{head}i=\textit{Attention}({RX_{j}^{R},LX_{j}^{L},WX_{j}^{W}})$ , the projections and params matrices $X_{j}^{R}\in S^{d_{\textit{model}}\times e_{k}}$ , $X_{j}^{R}\in S^{d_{\textit{model}}\times e_{k}}$ , $X_{j}^{W}\in S^{d_{\textit{model}}\times e_{w}}$ and $X^{\circ}\in S^{Hd_{w}\times d_{\textit{model}}}$ ; $h$ is aligned heads or layers of attention.

Multi-head attention can perform more than single-head attention and makes the network learn quickly. The normalization method was introduced to reduce the data scale in classification precision. The normalization formula can be represented in Eq. (43).

$\displaystyle z=(z_{\text{max}}-z_{\text{min}})\times(y-y_{\text{min}})/(y_{% \text{max}}-y_{\text{min}})$ (43)

Here, between the interval, [ $-$ 1, 1] the value of $z_{\text{min}}$ , $z_{\text{max}}$ is considered as well as $y_{\text{max}}$ and $y_{\text{min}}$ denotes the maximum and minimum value of the input signal. The training and testing data comes from the processed vibration signals with each length of the data of 1024 points and gets changed to 32 $\times$ 32 sizes of 2D gray color images. Under a 2hp load, the bearing was working, and 2D gray images were obtained by transforming the sampled signals. Based on facial recognition, CNN (CNN-E) was designed based on multi-head attention. The convolution stride is (1, 1) in the layer of the CNN model, and the data padding strategy is the same. Also, the “Batch Normalization” was added in the convolutional layer. In Fig. 5 the structure of adaptive CNN with attention mechanism is shown, which is used for the recognition of occlusion aware facial expressions.

Figure 5.

Developed occlusion aware facial expression recognition using Adaptive CNN with an attention mechanism.

6. Results and discussions

6.1 Simulation setting

The newly recommended HGSYGSO-ACNN-AM-based occlusion aware facial expression was implemented in python, and corresponding analysis over current facial recognition methods is used to examine the effectiveness of the designed method. The number of population taken was 10, and the chromosome length taken for facial feature recognition was 7. The efficiency of the system was studied through the measures like accuracy, F1-score and precision. The analysis was took over through the various suggested techniques includes Long Short-Term Memory (LSTM) [28], Radio Frequency (RF) [29], Deep Belief Network (DBN) [30] and Convolutional Neural Network (CNN) [31] and algorithms like Deep Hunting Optimization Algorithm(DHOA) [26], Grey Wolf Optimization (GWO) [27], GSO [32] and YSGA [33].

6.2 6.2 Evaluation measures

The other evaluation measures used to estimate the effectiveness are given as follows.

The accuracy value is given in Eq. (39).

The precision value is given in Eq. (40).

F1-score: The F1 score is defined in Eq. (44).

$\displaystyle\textit{F1SC}=\frac{2\textit{PE}_{\textit{te}}}{2\textit{PE}_{% \textit{te}}+\textit{PE}_{\textit{fe}}+\textit{NE}_{\textit{fe}}}$ (44)

Sensitivity: Sensitivity is calculated using Eq. (45).

$\displaystyle\textit{Sy}=\frac{\textit{TE}_{\textit{pe}}}{\textit{TE}_{\textit% {pe}}+\textit{FE}_{\textit{pe}}}$ (45)

Specificity: Specificity is estimated by Eq. (46).

$\displaystyle\textit{Sp}=\frac{\textit{TE}_{\textit{ne}}}{\textit{TE}_{\textit% {ne}}+\textit{FE}_{\textit{pe}}}$ (46)

FDR: FDR is calculated using Eq. (47).

$\displaystyle\textit{FR}=\frac{\textit{FE}_{\textit{pe}}}{\textit{FE}_{\textit% {pe}}+\textit{TE}_{\textit{pe}}}$ (47)

FPR: FPR is evaluated through Eq. (48).

$\displaystyle\textit{FR}=\frac{\textit{FE}_{\textit{pe}}}{\textit{FE}_{\textit% {pe}}+\textit{RT}_{\textit{ng}}}$ (48)

FNR: It is evaluated using Eq. (16).

$\displaystyle\textit{FN}=\frac{\textit{LF}_{\textit{ng}}}{\textit{RT}_{\textit% {ng}}+\textit{TE}_{\textit{pe}}}$ (49)

MCC: MCC is computed using Eq. (50).

$\displaystyle\textit{MCC}=\frac{\textit{TE}_{\textit{ps}}\times\textit{TE}_{% \textit{ng}}-\textit{FE}_{\textit{ps}}\times\textit{FE}_{\textit{ng}}}{\sqrt{(% \textit{TE}_{\textit{pe}}+\textit{FE}_{\textit{pe}})(\textit{TE}_{\textit{ps}}% +\textit{FE}_{\textit{ne}})(\textit{TE}_{\textit{ne}}+\textit{FE}_{\textit{pe}% })(\textit{TE}_{\textit{ne}}+\textit{FE}_{\textit{ne}})}}$ (50)

NPV: It is measured using Eq. (51).

$\displaystyle\textit{NP}=\frac{\textit{TE}_{\textit{ne}}}{\textit{TE}_{\textit% {ne}}+\textit{FE}_{\textit{ne}}}$ (51)

Figure 6.

Performance analysis on developed deep learning-based occlusion aware facial expression recognition system regarding “(a) accuracy (b) F1-score and (c) precision”.

Figure 7.

Performance analysis on developed deep learning-based occlusion aware facial expression recognition method via different recognition techniques in terms of (a) accuracy (b) F1-score and (c) precision.

Figure 8.

5-fold analysis on developed deep learning-based occlusion aware facial expression recognition method through different algorithms regarding “(a) accuracy (b) F1-score and (c) precision”

Figure 9.

5-fold analysis on offered deep learning-based occlusion aware facial expression recognition method through various recognition techniques with respect to “(a) accuracy (b) F1-score and (c) precision”.

6.3 Performance validation on dataset 1 using learning percentage

Figures 6 and 7 depicts the estimation of performance on the developed HGSYGSO-ACNN-AM-based-occlusion aware facial expression recognition system by utilizing the similarity measures with the above-mentioned heuristic algorithms. From this analysis, the precision of the proposed HGSYGSO-ACNN-AM-based facial expression recognition system has been obtained with 10.86%, 22.4%, 36.60%, 45.71% improved precision rates rather than other suggested systems such as DHOA, GWO, GSO and YGSA other suggested systems when assuming the learning percentage of 60. Thus, it enhanced the effectiveness of the occlusion aware facial expression recognition system rather than the recommended system.

6.4 Performance validation on dataset 1 using 5-fold

Figures 8 and 9 depicts the performance comparison of the developed HGSYSGO-ACNN-AM-based occlusion aware facial expression recognition system over different algorithms and recognition approaches over a variety of suggested models by the standard methods. From the analysis of 5-fold, the developed HGSYSGO-ACNN-AM-based occlusion aware facial expression recognition system acquires with 0.91%, 2.06%, 4.21% and 6.45% improved precision rate rather than the DHOA, GWO, GSO and YGSA when assuming the 5-fold value of 3. The 5-fold on various algorithms shows that the developed HGSYSGO-ACNN-AM-based occlusion aware facial expression recognition system has attained high precision rate than others for all the 5-fold values. Therefore, the overall effectiveness of the proposed HGSYSGO-ACNN-AM-based face recognition method was enhanced by the measures like accuracy, F1-score and precision.

6.5 Performance validation on dataset 2 using learning percentage

Figures 10 and 11 show the estimation of performance on the developed deep learning based-occlusion aware facial expression system with the heuristic algorithms by using the similarity measures. From the analysis of the suggested HGSYGSO-ACNN-AM-based occlusion aware facial recognition system obtained with 6.45%, 0.94%, 8.47% and 10.72% improved accuracy rate rather than DHOA, GWO, GSO and YGSA systems such as other suggested systems when considering the learning percentage of 55. Thus, it enhanced the facial recognition system’s efficiency rather than the recognition system. Hence, the overall efficiency of the developed deep learning based-occlusion aware facial expression system has progressed through the metrics.

Figure 10.

Performance analysis on developed deep learning-based occlusion aware facial expression recognition system via various optimization algorithms regarding “(a) accuracy (b) F1-score and (c) precision”.

Figure 11.

Performance analysis on proposed deep learning-based occlusion aware facial expression recognition system via different techniques with respect to “(a) accuracy (b) F1-score and (c) precision”.

6.6 Performance validation on dataset 2 using 5-fold

Figures 12 and 13 shows the performance evaluation of the developed HGSYSGO-ACNN-AM-based occlusion aware facial expression recognition system using different algorithms and analysis with diverse suggested models by the performance metrics. From the 5-fold analysis, the developed deep learning based-occlusion aware facial expression recognition method acquires with 4.21%, 3.12%, 2.06% and 0.40% improved precision rate rather than the DHOA, GWO, GSO and YGSA when assuming the 5-fold value of 2. The analysis of 5-fold on various algorithms proves that the developed deep learning based-occlusion aware facial expression recognition has reached a better precision rate when compared to other all the 5-fold values. Hence the performance efficiency of the developed HGSYSGO-ACNN-AM-based occlusion aware facial expression recognition system has improved through the metrics.

Figure 12.

5-fold analysis on proposed deep learning-based occlusion aware facial expression recognition method via different optimization algorithms regarding “(a) accuracy (b) F1-score and (c) precision”.

Figure 13.

5-fold analysis on proposed deep learning-based occlusion aware facial recognition method through various techniques with respect to “(a) accuracy (b) F1-score and (c) precision”.

Table 2

Performance evaluation on the developed occlusion aware facial expression recognition system through divergent heuristic algorithms

Dataset 1
Metrics	DHOA-ACNN-AM [26]	GWO-ACNN-AM [27]	GSO-ACNN-AM [32]	YSGA-ACNN-AM [33]	HGSYSGO-ACNN-AM
FDR	46.37224	41.76373	36.26571	29.57198	21.94093
Sensitivity	87.62887	90.20619	91.49485	93.29897	95.36082
NPV	97.69452	98.20331	98.47151	98.81926	99.19715
Precision	53.62776	58.23627	63.73429	70.42802	78.05907
Accuracy	87.40795	89.35935	91.34757	93.44624	95.5081
FPR	12.62887	10.78179	8.676976	6.52921	4.467354
MCC	62.04169	66.95282	71.77571	77.51496	83.79789
Specificity	87.37113	89.21821	91.32302	93.47079	95.53265
F1-Score	66.5362	70.77856	75.13228	80.26608	85.84687
FNR	12.37113	9.793814	8.505155	6.701031	4.639175
Dataset 2
FPR	11.84151	10.03503	9.01627	7.113074	3.531283
Sensitivity	88.16171	89.98646	91.0697	92.84931	96.47302
NPV	97.81092	98.17871	98.39045	98.73321	99.39434
Precision	55.37421	59.91242	62.7343	68.50944	81.99255
Accuracy	88.15895	89.96804	90.99601	92.88155	96.46933
MCC	63.71107	68.15034	70.82016	75.92847	86.97263
Specificity	88.15849	89.96497	90.98373	92.88693	96.46872
F1-Score	68.02318	71.93258	74.29188	78.84363	88.64532
FNR	11.83829	10.01354	8.930299	7.150687	3.526984

Table 3

Performance evaluation on the developed occlusion aware facial expression recognition model through diverse techniques

Dataset 1
Metrics	LSTM [28]	RF [29]	DBN [30]	CNN [31]	HGSYGSO-ACNN-AM
FDR	42.59567	37.54448	31.74905	25.91093	21.94093
Sensitivity	88.91753	90.46392	92.52577	94.3299	95.36082
NPV	97.9669	98.28227	98.6758	99.0099	99.19715
Precision	57.40433	62.45552	68.25095	74.08907	78.05907
Accuracy	88.99116	90.86892	92.78351	94.47717	95.5081
MCC	65.68546	70.31413	75.58007	80.58225	83.79789
FPR	10.99656	9.063574	7.17354	5.498282	4.467354
MCC	65.68546	70.31413	75.58007	80.58225	83.79789
Specificity	89.00344	90.93643	92.82646	94.50172	95.53265
F1-Score	69.76744	73.89474	78.5558	82.9932	85.84687
FNR	11.08247	9.536082	7.474227	5.670103	4.639175
Dataset 2
FDR	42.42462	37.65757	32.22992	25.58364	18.00745
Sensitivity	89.01928	90.88271	92.66877	94.5451	96.47302
NPV	97.98662	98.35493	98.69843	99.04793	99.39434
Precision	57.57538	62.34243	67.77008	74.41636	81.99255
Accuracy	89.06073	90.85508	92.6568	94.57734	96.46933
MCC	65.86854	70.43429	75.30824	80.91793	86.97263
Specificity	89.06764	90.85047	92.6548	94.58272	96.46872
F1-Score	69.92504	73.95456	78.2874	83.28174	88.64532
FNR	10.98072	9.117287	7.331227	5.454897	3.526984

6.7 Determination of performance metrics over different algorithms and techniques

Tables 2 and 3 illustrate the performance evaluation on the proposed deep learning-based-occlusion aware facial recognition method over various optimization algorithms and divergent conventional techniques regarding the performance metrics. The analysis shows the developed HGSYSGO-ACNN-AM-based occlusion aware facial expression recognition system attained 56%, 4.92%, 7.27% and 9.6% improved accuracy than the LSTM, RF, DBN and CNN techniques. The effectiveness of the proposed occlusion aware facial expression recognition system is extremely maximum high rather than the conventional system.

7. Conclusion

A new deep learning-based occlusion aware facial recognition method was used in order to recognize the facial reactions by analyzing the facial features with better precision and accuracy rate. Also, this suggested system provided information on the basis of the extracted face features to recognize the expression efficiently. Hence, the two datasets were taken from the occluded occlusion aware facial expression recognition databases. Here, the integral or raw image was given as input to the viola Jones method and the face cropped image was taken as output. Next, the face detected image was fed as input to the classification stage and the output obtained as the recognition of facial features using ACNN-AM method. Here the hidden neuron count, epoch and learning percentage were optimized by the proposed HGSYSGO to increase the performance. Overall, occlusion aware facial expression recognition system has accomplished with improved accuracy of 2.56%, 4.92%, 7.27% and 9.6% than the DHOA-ACNN-AM, GWO-ACNN-AM, GSO-ACNN-AM and YGSA -ACNN-AM. Hence, the overall efficiency of the proposed HGSYSGO-ACNN-AM-based occlusion aware facial expression recognition model has highly enhanced when compared to other methods.

References

Zhang

Yang

Qiu

. AP-GAN: Improving Attribute Preservation in Video Face Swapping. IEEE Transactions on Circuits and Systems for Video Technology. 2022 April; 32(4): 2226-2237.

Xie

Tian

Bai

Shen

. Triplet Loss With Multistage Outlier Suppression and Class-Pair Margins for Facial Expression Recognition. IEEE Transactions on Circuits and Systems for Video Technology. 2022 Feb; 32(2): 690-703.

Zhang

. Joint Expression Synthesis and Representation Learning for Facial Expression Recognition. IEEE Transactions on Circuits and Systems for Video Technology. 2022 March; 32(3): 1681-1695.

Kotsia

Pitas

. Facial Expression Recognition in Image Sequences Using Geometric Deformation Features and Support Vector Machines. IEEE Transactions on Image Processing. 2007 Jan; 16(1): 172-187.

Zhang

Mao

. Geometry Guided Pose-Invariant Facial Expression Recognition. IEEE Transactions on Image Processing. 2020; 29: 4445-4460.

Ding

Zhao

Yuan

. Facial Expression Recognition from Image Sequence Based on LBP and Taylor Expansion. IEEE Access. 2017; 5: 19409-194.

Yang

Cao

Zhang

. Facial Expression Recognition Using Weighted Mixture Deep Neural Network Based on Double-Channel Facial Images. IEEE Access. 2018; 6: 4630-4640.

Xia

Zheng

Wang

Dong

Wang

. Local and Global Perception Generative Adversarial Network for Facial Expression Synthesis. IEEE Transactions on Circuits and Systems for Video Technology. 2022 March; 32(3): 1443-1452.

Huang

Zhang

Zha

Fang

Zhang

. Identity-Aware Facial Expression Recognition Via Deep Metric Learning Based on Synthesized Images. IEEE Transactions on Multimedia. 2022; 24: 3327-3339.

10.

Khan

Chen

Yan

. Co-Clustering to Reveal Salient Facial Features for Expression Recognition. IEEE Transactions on Affective Computing. 2020; 11(2): 348-360.

11.

Meng

Bianchi-Berthouze

Deng

Cheng

Cosmas

. Time-Delay Neural Network for Continuous Emotional Dimension Prediction From Facial Expression Sequences. IEEE Transactions on Cybernetics. 2016 April; 46(4): 916-929.

12.

Mohan

Seal

Krejcar

Yazidi

. Facial Expression Recognition Using Local Gravitational Force Descriptor-Based Deep Convolution Neural Networks. IEEE Transactions on Instrumentation and Measurement. 2021; 70: 1-12.

13.

Rahulamathavan

Phan

RCW

Chambers

Parish

. Facial Expression Recognition in the Encrypted Domain Based on Local Fisher Discriminant Analysis. IEEE Transactions on Affective Computing. 2013; 4(1): 83-92.

14.

Zhang

. Expression-EEG Based Collaborative Multimodal Emotion Recognition Using Deep AutoEncoder. IEEE Access. 2020; 8: 164130-164143.

15.

Zhang

Pan

Cui

Zhao

Liu

. Learning Affective Video Features for Facial Expression Recognition via Hybrid Deep Learning. IEEE Access. 2019; 7: 32297-32304.

16.

Acharya

Huang

Pani Paudel

Van Gool

. Covariance pooling for facial expression recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). 2018. pp. 367-374.

17.

Zhang

Mao

. Joint pose and expression modeling for facial expression recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2018. pp. 3359-3368.

18.

Kim

Lee

Sohn

. Multi-Modal Recurrent Attention Networks for Facial Expression Recognition. IEEE Transactions on Image Processing. 2020; 29: 6977-6991.

19.

Ding

Zhao

Yuan

. Facial Expression Recognition from Image Sequence Based on LBP and Taylor Expansion. IEEE Access. 2017; 5: 19409-19419.

20.

Arumugam

Jeen Retna Kumar

Sundaram

. Facial emotion recognition using subband selective multilevel stationary wavelet gradient transform and fuzzy support vector machine. The Visual Computer. 2021; 37: 2315-2329.

21.

Sun

. Facial expression and action unit recognition augmented by their dependencies on graph convolutional networks. Journal on Multimodal User Interfaces. 2021; 15: 429-440.

22.

Chikontwe

Gao

Lee

. Transformation guided representation GAN for pose invariant face recognition. Multidimensional Systems and Signal Processing. 2021; 32: 633-649.

23.

Liang

Zhang

Liu

. A convolution-transformer dual branch network for head-pose and occlusion facial expression recognition. The Visual Computer. 2022.

24.

Huang

Yang

Pun

C-M

Ling

W-K

Cheng

. Rapid facial expression recognition under part occlusion based on symmetric SURF and heterogeneous soft partition network. Multimedia Tools and Applications. 2020; 79: 30861-30881.

25.

Zeng

Shan

Chen

. Occlusion Aware Facial Expression Recognition Using CNN With Attention Mechanism. IEEE Transactions on Image Processing. 2019 May; 28(5): 2439-2450.

26.

Dabhi

Pancholi

. Face Detection System Based on Viola – Jones Algorithm. International Journal of Science and Research (IJSR). 2013; 6: 14.

27.

Wang

Yan

Sun

Chen

. Intelligent bearing Fault diagnosis using Attention-based CNN. Procedia Manufacturing. 2020; 49: 112-118.

28.

Jiang

Zhao

Sahli

Zhang

. Speech driven photo realistic facial animation based on an articulatory DBN model and AAM features. Multimedia Tools and Applications. 2014; 73: 397-415.

29.

Podder

Bhattacharya

Majumdar

. Time efficient real time facial expression recognition with CNN and transfer learning. Sādhanā. 2022; 47(177).

30.

Brammya

Praveena

Ninu Preetha

Ramya

Rajakumar

Binu

. Deer Hunting Optimization Algorithm: A New Nature-Inspired Meta-heuristic Paradigm. The Computer Journal. 2019.

31.

Meraihi

Gabis

Mirjalili

Ramdane-Cherif

. Grasshopper Optimization Algorithm: Theory, Variants, and Applications. IEEE Access. 2021; 9: 50001-50024.

32.

Muthiah-Nakarajan

Noel

. Galactic Swarm Optimization: A new global optimization metaheuristic inspired by galactic motion. Applied Soft Computing. 2016 January; 38: 771-787.

33.

Miyoshi

Nagata

Hashimoto

. Facial-Expression Recognition from Video using Enhanced Convolutional LSTM. In: 2019 Digital Image Computing: Techniques and Applications (DICTA). 2019. pp. 1-6.

34.

Liu

. CNN-LSTM Facial Expression Recognition Method Fused with Two-Layer Attention Mechanism. Computational Intelligence and Neuroscience. 2022.

Hybrid heuristic mechanism for occlusion aware facial expression recognition scheme using patch based adaptive CNN with attention mechanism

Abstract

Keywords

1. Introduction

2.1 Related works

2.2 Problem statement

Table 1 Existing occlusion aware facial expression recognition models

3.1 Experimented datasets

4.1 Viola Jones algorithm for pattern extraction

5.1 CNN architecture

6.1 Simulation setting

6.2 6.2 Evaluation measures

6.4 Performance validation on dataset 1 using 5-fold

6.5 Performance validation on dataset 2 using learning percentage

7. Conclusion

References

Table 1
Existing occlusion aware facial expression recognition models