Facial expression recognition under occlusion conditions based on multi-feature cross-attention

Abstract

Although facial expression recognition (FER) has a wide range of applications, it may be difficult to achieve under local occlusion conditions which may result in the loss of valuable expression features. This issue has motivated the present study, as a part of which an effective multi-feature cross-attention network (MFCA-Net) is proposed. The MFCA-Net consists of a two-branch network comprising a multi-feature convolution module and a local cross-attention module. Thus, it enables decomposition of facial features into multiple sub-features by the multi-feature convolution module to reduce the impact of local occlusion on facial expression feature extraction. In the next step, the local cross-attention module distinguishes between occluded and unoccluded sub-features and focuses on the latter to facilitate FER. When the MFCA-Net performance is evaluated by applying it to three public large-scale datasets (RAF-DB, FERPlus, and AffectNet), the experimental results confirm its good robustness. Further validation is performed on a real FER dataset with local occlusion of the face.

Keywords

Facial expression recognition deep convolution multi-feature convolution module local cross-attention module

1 Introduction

Facial expressions are the most direct form of communication and expression of human emotions and emotional states (e.g., sadness, fear, anger, surprise, disgust, joy, contempt, neutral, etc.). Thus, extensive research has been conducted in the field of facial expression recognition (FER) with the aim of its adoption in a variety of domains, such as artificial intelligence (AI), human-computer interaction [1], driving fatigue monitoring [2], and knowledge acquisition detection [3]. However, local occlusion during the recognition process can induce noticeable changes in the facial appearance in space, compromising the FER effectiveness. Thus, extensive research effort has been dedicated to overcoming this shortcoming.

Most currently available facial expression feature extraction strategies are based on traditional machine learning (ML) and deep learning (DL) methods. The approaches in the former category mainly rely on hand-crafted features or shallow learning, such as local binary patterns (LBP) [4], LBP on three orthogonal planes (LBP-TOP) [5], sparse learning [6], and histogram of gradient (HOG) [7]. However, these hand-crafted features are often insufficiently robust and accurate under local occlusion conditions. Consequently, to capitalize on the advances in the DL field, various DL network models have been proposed [8 –12] as this strategy facilitates focus on the more meaningful areas of expression [13 , 19]. In the more recent studies, the occlusion problem was addressed by reconstructing the occluded face regions through the application of deep models. For example, Lu et al. [15] reconstructed the occluded facial regions based on the Wasserstein generative adversarial network model to highlight enough expression features. However, due to the wide variety of occlusion positions and types, most facial images cannot be reconstructed accurately based on this approach. To overcome this challenge and more accurately restore local occlusion images, Liu et al. [16] proposed an end-to-end network model for local occlusion in low-quality images featuring human face. On the other hand, Poux et al. [17] proposed a new auto-encoding method with hop links and applied it in the reconstruction of occlusion sections in optical funnels. While these methods effectively address occlusion in controlled settings and specific image types, they are difficult to generalize and thus lave limited utility in real-world scenarios. Therefore, to alleviate the adverse influence of occlusion on the facial expression features, Li et al. [18] proposed the PG-CNN attention model based on the facial landmark point selection aided by the local region block input attention network. Several authors have also adopted patch-based analysis to solve the occlusion issues, and have shown that it is capable of capturing the importance of each relevant facial feature. For this purpose, Wang et al. [19] performed random, fixed position, and landmark-based cropping on relatively large regions, and used the relational attention module and the region bias loss function to refine the weights initially assigned to the facial features. However, as the aforementioned methods rely on facial landmark points for region block selection, their performance is compromised when applied to occluded face images. Thus, to improve the FER performance, researchers are increasingly drawing upon the findings yielded by human psychology studies, which indicate that the facial perception mechanism in the human brain extracts both global and local key information when interpreting emotions. Specifically, in the MA-Net proposed by Zhao et al. [20], the multi-scale module and CBAM module [21] are utilized to extract global and local facial feature information, thus effectively eliminating the interference of occlusion. In this context, attention mechanism is also studied with the goal of obtaining detailed local information. For example, to enhance the discrimination ability, Farzaneh et al. [22] proposed Deep Attention Center Loss (DACL) which adaptively selects some important feature elements. On the other hand, in the mask-based attention parallel network developed by Ju et al. [23], the binary mask extracted from key landmark detection is employed to construct a mask-based attention module. This method locates the region related to expression and embeds it into the parallel network to extract features. The extracted parallel features are subsequently segmented into multiple independent blocks from the spatial dimension, allowing facial expressions to be independently predicted, thus addressing the region occlusion problem. However, these methods only focus on a single regional feature of the face, which may be difficult to obtain enough recognitionfeatures.

Therefore, Although the methods discussed above alleviate many of the occlusion issues, in practice, FER is still affected by several problems, as the loss of expression feature information due to local occlusion undermines the discrimination ability of deep convolutional models, due to which FER may require multiple regional features for expression recognition under local occlusion. As the existing methods can use a single regional feature only, this shortcoming has motivated us to propose an effective Multi-Feature Cross-Attention Network (MFCA-Net). This method benefits from a multi-feature convolution module and thus reduces the influence of deep occlusion on deep convolution. Moreover, the proposed strategy can decompose deep features into multiple sub-features, as well as extract rich and robust multi-scale expression recognition features from each sub-feature. As local occlusion may require multiple regional features for expression recognition, a local cross-attention module is developed to allow the model to focus on multiple salient features simultaneously.

As shown in Figure 1, the MFCA-Net consists of a multi-feature convolution module branch and local-cross attention module branch. To alleviate the issues of missing expression information features, multi-feature convolution module learns multi-scale features of a single basic block, while the local cross-attention module enables simultaneous focus on multiple salient features. The main contributions of this work are summarized below:

Fig. 1

The pipeline of MFCA-Net.

1) The proposed MFCA-Net employs multi-feature convolution and local cross-attention to effectively address the face occlusion issue in real-world scenarios.

2) As the multi-feature convolution module decomposes the features within a single basic block, it can reduce the impact of occlusion on deep convolution. The channel salient features of SENet attention are introduced to highlight the overall features, obtain rich and robust multi-scale expression recognition features, and alleviate the problem of expression information feature loss.

3) The local cross-attention module is used to attend to the prominent features of multiple local, unoccluded face regions.

4) MFCA-Net outperforms the state-of-the-art methods not only on the FER datasets including RAF-DB, FERPlus, AffectNet-7, and AffectNet-8, but also on the occlusion subsets including Occlusion-RAF-DB, Occlusion-FERPlus, Occlusion-AffectNet, and FED-RO.

2 Related work

As the main aim of this work is improving the FER performance under occlusion conditions, in the sections below, focus is given to the extant FER research, the attention mechanism under occlusion in real-world scenarios in particular.

2.1 Occlusion FER

In real-world scenarios, the face can easily be occluded by different objects (such as masks, glasses, hands, scarfs, etc.) which can be present simultaneously and may appear in random positions. Thus, while the occlusion caused by glasses and hats can be roughly predicted, temporary occlusion due to occasionally placing hands in front of one’s face is difficult to model. Currently, facial occlusion is mostly addressed via holistic-based or region-based approaches.

In the holistic-based methods, the face is treated as a whole and relevant features are extracted through deep learning. In this context, to mitigate the impact of occlusion on feature extraction, Zhao et al. [18] proposed the MA-Net network, which relies on the extraction of both global and local facial features. As a part of their work, Lu et al. [15] reconstructed the occluded facial regions based on the Wasserstein generative adversarial network model to highlight enough expression features. However due to the large number of locations and types of occlusion, this method cannot accurately reconstruct facial images, leading to unsatisfactory de-occlusion results. To overcome these challenges, Zhao et al. [24] proposed a robust FER network denoted as EfficientFace, which relies on local feature extractors and channel spatial modulators that can perceive both local and global facial features. According to their test results based on wild datasets, this strategy exhibits strong robustness under occlusion conditions.

The authors that opted for region-based methods explicitly divide the face into several overlapping or non-overlapping segments. For example, Zhong et al. [25] developed a graph structure representation method where each node on the graph represents the appearance information around a facial landmark, and the edges represent the geometric information encoded by the distance between two nodes. On the other hand, Gong et al. [26] proposed a multi-feature fusion network (MFNet) based on a shallow Gabor convolutional network designed to enhance the adaptability of learning features for orientation and scale changes, and to improve the capacity to capture detailed local features. As a part of their investigation, Ruan et al. [27] constructed a path selection multiple network model to achieve FER under local facial occlusion scenarios.

2.2 Attention

According to the extant research on human visual perception, the visual gaze can be quickly shifted by the attention mechanism to focus on the target of interest. When presented with occluded facial features, humans typically only focus on the unoccluded local features, and this strategy is increasingly being explored in FER research.

For example, Albanie et al. [28] proposed the Squeeze and Excitation Networks (SENet) to perform channel-wise feature reconstruction, thereby enhancing the expression feature learning potential. Similarly, Woo et al. [21] proposed the Convolution Block Attention Module (CBAM), which sequentially connects channel and spatial attention to obtain rich attention features. As a part of their work, Li et al. [18] addressed the issues of local facial occlusion in real-world FER by adopting robust global-local attention (gA-CNN and PG-CNN) networks. Their approach was shown to improve the overall recognition accuracy by selecting the most relevant 24 points and reconstructing the weight in each partition using attention. On the other hand, to enhance the recognition of facial expressions in occluded images, Wang et al. [19] proposed the Region Attention Network (RAN) to capture the key regions in images containing human faces affected by different degrees and types of occlusion and featuring pose variation. To address the fact that different facial classes have intrinsic similarities in facial features, and the difference between facial expressions may be subtle, Wen et al. [29] developed an attention network which recognizes that facial expressions are simultaneously expressed in multiple facial regions, achieving good results.

However, most of the attention methods discussed above focus on a single facial region to enhance the facial feature recognition ability. In contrast, the local cross-attention module proposed in this work simultaneously focuses on the most informative channel features and the most meaningful spatial expression regions, thereby capturing features with strong discriminative power and handling the problem of local occlusion more effectively. Compared with the strategy proposed by other authors [29], the local cross-attention method proposed here is more parsimonious, as the total number of model parameters is 14.38M, and the GFLOPs is 1.95G.

3 Method

3.1 Overview

To address the fact that local occlusion of the face leads to the loss of expression features, the MFCA-Net network is developed to obtain robust facial features even under occlusion conditions. MFCA-Net comprises the backbone network ResNet18 [30], the multi-feature convolution module, and the local cross-attention module. As can be seen from Figure 2a depicting the basic structure of ResNet18, it is used for feature extraction, while considering the number of parameters in the model. It also benefits from a shortcut method to solve the problems of network degradation, gradient disappearance, and explosion. Next, the multi-feature convolution module obtains multi-scale features from multiple directions of the face, which can effectively reduce the impact of occlusion on deep convolution. In the next step, the local cross-attention module captures and integrates the attention features of different facial expression regions to decrease the influence of non-expressive regions, thereby enhancing the significant expression features situated in the unoccluded local regions of the face. Finally, 512 feature vectors are obtained for each of the two modules, and the recognition result is obtained through decision fusion by fully-connected layers.

Fig. 2

(a) ResNet18 Basic Block. (b) SENet Block.

3.2 Symmetric multi-feature convolution module

In computer vision tasks, multi-scale features are established on the fine-grained feature descriptions of visual images by uniformly dividing the feature map into several independent subsets and extracting and combining these independent features. The currently available networks incorporate multi-scale modules, such as DCN, PyConv, etc., as this improves network performance, given that multi-scale features are utilized in image classification, facial analysis, and many other domains. In essence, these methods represent multi-scale features in a hierarchical manner.

In this work, the multi-feature convolution module designed in this paper extracts multi-scale features within a single basic block. In the convolutional neural network, the convolution operation of each layer will extract the feature of the input feature map in the local receptive field. The deeper convolution usually has a wider receptive field for extracting high-level semantic features. However, shallower convolutions have narrower receptive fields and are used to extract rich geometric features. A wide receptive field is easily affected by occlusion, and adding shallow geometric features can effectively reduce the impact of occlusion on depth convolution.

Fig. 3

Symmetric multi-feature convolution module.

As shown in Figure 3, depicting the structure of the multi-feature convolution module adopted in this work,a symmetric multi-feature convolution module is incorporated in the last convolution layer of ResNet18, which consists of a regular 3×3 and 1×1 convolution, a 3×3 convolution with a dilation rate of 2, and SENet. The dilated convolution can can assist the regular convolution in obtaining a larger receptive field without adding any parameters, which allows features to be extracted from a larger range of images. SENet is mainly used to enhance the significant channel features after the aggregation of sub-features. As can be seen from Figure 2b, SENet is a lightweight attention mechanism that only focuses on channels. The Squeeze operation applies global pooling to compress the W×H×C feature map to 1×1×C, thereby obtaining global receptive field information. This is followed by an excitation process that assigns relevance weights to each part. In the multi-feature convolution module,a feature map X is first obtained by performing a 3×3 convolution. The feature map X is evenly divided into S mapping subsets denoted by X_i, where i ∈ {1, 2, …, S}, due to which each subset of feature mapping X_i has a spatial size of 1/S, which is the same for all subsets. Next, each X_i is processed by a regular 3×3 convolution and a dilated convolution with a dilation rate of 2. The S and X_i are concatenated before applying SENet to enhance the channel-wise significant multi-scale features. the two symmetrical modules are subjected to concatenation and a 1×1 convolution layer. The output of multi-feature convolution module can be expressed as: $Y_{i}^{p_{1}} = {\begin{matrix} F_{i}^{p_{1}} (f_{1} (X_{i}) + f_{2} (X_{i})), i = 1 \\ Y_{i}^{p_{1}} + F_{i - 1}^{p_{1}}, 1 < i \leq S \end{matrix}$ (1)

$Y_{i}^{p_{2}} = {\begin{matrix} F_{i}^{p_{1}} (f_{1} (X_{i}) + f_{2} (X_{i}), i = S \\ Y_{i}^{p_{2}} + F_{i + 1}^{p_{2}}, 1 \leq i < S \end{matrix}$ (2)

$Y_{i} = f_{1}^{1 \times 1} (Concat (F_{se} (Y_{i}^{p_{1}}), F_{sq} (Y_{i}^{p_{2}})))$ (3)

Where p₁ ∈ {up}, p₂ ∈ {down}, f₁(·) denotes the normal 3×3 convolution, f₂(·) represents the 3×3 convolution with a dilation rate of 2, $f_{1}^{1 \times 1}$ (·) is the 1×1 convolution, $F_{i}^{p_{k}}$ (·) is the output of X_i processed by a normal 3×3 convolution and a 3×3 convolution with a dilation rate of 2, $Y_{i}^{p_{k}}$ represents the output of $F_{i}^{p_{k}}$ , where k ∈ {1, 2}, Concat is the feature concatenation, and F_se(·) is the result of SENet processing, and can be expressed as:

$Z = F_{sq} (Y_{i}^{p_{k}}) = \frac{1}{W \times H} \sum_{i = 1}^{W} \sum_{j = 1}^{H} Y_{i}^{p_{k}} (i, j)$ (4)

$S = F_{ex} (Z, W) = σ (W_{2} δ (W_{1} Z))$ (5)

$F_{se} = S \cdot Y_{i}^{p_{k}}$ (6)

From Eq. (4,5,6), W×H is the spatial dimension, F_sq(·) is the Squeeze operation, F_ex(·) is the Excitation operation, W₁ and W₂ are the weights of the two fully-connected layers used for reducing and increasing dimensionality, δ is the ReLU activation function, and σ is the Sigmoid activation function.

In the multi-feature convolution module, S is searched only in the range 1 to 10 with a step size of 1. For our setting S=4 was found to perform best. A larger S may lose more information and also increase the computational overhead. A small S results in insufficient information to be obtained. In order to obtain richer features, all $Y_{i}^{p_{k}}$ are connected to achieve information interaction between mapping subset X_i and mapping subset X_i. Moreover, the ReLU activation function is employed in the multi-feature convolution module. As the parameters are continuously updated and the data distribution changes during the model learning phase, adding a BN layer to normalize the feature map in the module is beneficial for the learning process.

The multi-feature convolution module is consisted of four multi-scale convolution blocks and an SENet module. Thus, a larger number (S) of mapping subsets facilitates learning based on significantly richer receptive field sizes, but it may increase the number of model parameters which would render it computationally expensive. On the other hand, since multi-feature convolution can simultaneously learn deeper semantic features and shallower geometric features, it increases the diversity of facial expression features and can reduce the sensitivity of deep convolution to occlusion. It is inspired by [31, 32], that occlusion noise affects the judgment of deep convolution on the target task, and the analysis considering sensitivity will be combined with the target task in the future. To demonstrate the effectiveness of the multi-feature convolution module proposed in this work, class activation maps were visualized through Grad-CAM++ [33] whereby darker color indicates greater focus of the multi-feature convolution module on the prominent facial region. As shown in Figure 4, compared with the ResNet18 residual structure, the convex region is larger, indicating that the multi-feature convolution module can extract more diverse features than ResNet18, allowing the model to obtain sufficient number of features for effective FER.

Fig. 4

The proposed method was applied to the RAF-DB validation set, and the results were visualized using Grad CAM++ for baseline, multi-feature convolution module, and local cross-attention.

3.3 Local cross-attention module

For FER under occlusion conditions, most existing methods rely on the unoccluded parts of the face. However, in practice, FER under local occlusion may require multiple regional features for expression recognition, whereas the existing methods can only use a single regional feature. To overcome this shortcoming, a local cross-attention module is developed that allows simultaneous focus on multiple local unoccluded facial regions, whereby FER is achieved by enhancing the subtle features that are present in multiple local regions.

As shown in Figure 5, the local cross-attention module is composed of spatial attention unit and channel attention unit. The spatial attention unit consists of a 1×1 convolution, a 3×3 convolution, and a ReLU activation function, the main role of which is to extract local features. The channel attention unit is composed of a global Max pooling (GMP), two linear layers, and a Sigmoid activation function. The GMP takes the maximum operation for each feature channel from the spatial unit to aggregate the spatial information of the feature map, while the two linear layers are used to achieve a mini autoencoder which is applied to encode channel information.

Fig. 5

Local cross-attention module.

The spatial attention unit receives the feature map S extracted from ResNet18, S ∈ R^7×7×512, where 7 denotes the spatial size and 512 is the channel size. Accordingly, the extracted feature map is subsequently reduced to 256 through a 1×1 convolution and the spatial features are extracted via a 3×3 convolution. The channel unit receives the spatial features as input and outputs the extracted channel features. Finally, a feature vector with a local cross-attention size of 512 is obtained. These processes are described in more detail below.

Let N_r be the ResNet18 backbone network, where p_r denotes its parameter, and x_i is the input feature vector. Then, the following expression holds:

$X_{i}^{'} = N_{r} (p_{r}, x_{i})$ (7)

Suppose H_j is the spatial attention number and F_sj is the output spatial attention feature, where j is the number of cross heads, then the spatial attention of the output can be expressed as follows:

$F_{sj} = X_{i}^{'} \times H_{j} (w_{s}, X_{i}^{'}) j \in {1, 2, \dots, n}$ (8)

Where w_s is the network parameter for H_j.

Similarly, suppose $H_{j}^{'}$ represents the channel attention number and F_cj denotes the attention feature of the final output. Then, the cross attention feature of the output can be expressed as:

$F_{cj} = F_{sj} \times H_{j}^{'} (w_{c}, F_{sj}) j \in {1, 2, \dots, n}$ (9)

Where w_c is the network parameter for $H_{j}^{'}$ .

To demonstrate the effectiveness of the local cross-attention module, the Grad-CAM++ was applied and the findings are shown in Figure 4. It is evident that, compared with the traditional ResNet18, the CAM based on local cross-attention enhances the model’s ability to extract facial expression features under occlusion conditions. For example, when applied to a masked face, the module can focus solely on the multiple regions of the face that are not occluded, whereas in the image showing a frontal face, the module can focus on the multiple prominent regions of the facial components.

3.4 Loss function

(1) Cross-attention fusion loss

As cross-attention can capture multiple facial regions and form multiple attention maps, to avoid overlaps, the attention maps are first scaled using the Log-softmax function to emphasize the most relevant regions. Next, partition loss [29] is used to guide the attention to different regions. Finally, the attention maps are normalized through a BN layer, as described below.

Assuming that Z=a_i ∈ R^n×c, the Log-softmax function can be expressed as follows:

$Log (Softmax (Z)) = \frac{\exp (Z_{i})}{\sum_{j = 1}^{c} \exp (Z_{j})}$ (10)

Where Z_i is the i-th vector of the attention map, while Z_j represents its j-th element. Accordingly, the partition loss can be represented as:

$L_{af} = \frac{1}{NC} \sum_{i = 1}^{N} \sum_{j = 1}^{C} \log (1 + \frac{H}{σ_{i, j}^{2}})$ (11)

Where C represents the attention map channel size and $σ_{i, j}^{2}$ is the variance of the i-th sample within the j-th channel.

(2) Cross-entropy Loss

Due to the poor complementarity of MFCA-Net on the two branches, decision-level fusion is adopted. Specifically, after global average pooling, the multi-scale aggregation module obtains a 512-dimensional feature vector, while the cross-attention module normalizes features through a BN layer, which also results in a 512-dimensional feature vector. These two branches make predictions through a fully-connected layer, using the Cross-entropy Loss as the loss function. Therefore, inspired by [34], in our work, Cross-entropy Loss is used to quantify MFCA-Net, and the input is the probability distribution of the model output, which represents the predicted probability for each target task category. The output is to compute the cross-entropy between the predicted probability distribution and the true probability distribution, which outputs a scalar value as the loss. A smaller loss means that the model’s prediction is closer to the true distribution, while a larger loss means that the model’s prediction is more different from the true distribution. The Cross-Entropy Loss can be formulated as:

$L_{CE} = - \frac{1}{N} \sum_{i = 0}^{N - 1} log \frac{e^{W_{y_{i}}^{(k)^{T}}} v_{i}^{(k)} + b_{y_{i}}^{(k)}}{\sum_{j = 0}^{C - 1} e^{W_{y_{j}}^{(k)^{T}}} v_{i}^{(k)} + b_{y_{j}}^{(k)}}$ (12)

Where N is the number of samples, C represents the number of expression categories, $W_{y_{i}}^{(k)}$ is the weight matrix of the fully-connected layer, $b_{y_{i}}^{(k)}$ denotes the bias term of the FC layer, $v_{i}^{(k)}$ is the fully-connected input of the i-th sample, and y_i is its class label.

(3) Combined loss function

The overall loss function is attained by combining the cross-entropy loss and the partition loss to optimize our MFCA-Net, expressed as:

$L = L_{CE} + L_{af}$ (13)

4 Experimental results

The effectiveness of MFCA-Net presented in the preceding sections was validated by applying it to three public datasets, namely RAF-DB [35], AffectNet [36], and FERPlus [37]. Further tests were performed on four real occlusion datasets, namely Occlusion-RAF-DB, Occlusion-FERPlus, Occlusion-AffectNet, and FED-RO [38]. In addition, ablation experiments were carried out to assess the effectiveness of each module, using benchmark data exemplified in Figure 6.

Fig. 6

Sample images extracted from the RAF-DB (the first row), FERPlus (the second row), and AffectNet (the third row) databases.

When applying it to the RAF-DB, AffectNet, and FERPlus datasets, the MFCA-Net model was directly trained using the officially aligned images. For this purpose, all input images were resized to 224×224 pixels before performing random cropping, horizontal flipping, and erasing as these data augmentation methods prevent over-fitting. As the ResNet18 model served as the baseline network, to achieve fair evaluation, a pretrained ResNet-18 model from the MSCeleb-1M [39] face recognition dataset was used in all experiments. The MFCA-Net model was implemented in Pytorch code on a Windows 10 operating system and was trained on a workstation with a TESLA T4 16GB GPU. MFCA-Net model had 20.87M parameters with the GFLOPs of 2.42G. In the stage of model inference, the recognition time of MFCA-Net on a single image is 0.0767s.

During the model training process, the SGD optimizer with a momentum parameter of 0.9 and a weight decay of 1e-4 was utilized. Model training was conducted on the RAF-DB and FERPlus datasets for 50 epochs, with the batch sizes of 256 and 128, and the initial learning rates of 0.1 and 0.04, respectively. After the completion of each set of 10 epochs, the learning rate was decayed by a factor of 0.1. For the AffectNet-7 and AffectNet-8 datasets, model training was conducted for 20 epochs, whereby the learning rate decayed by a factor of 0.1 every 4 epochs, while the batch size was set to 256 and the initial learning rate to 0.04. To achieve class balance, during the training phase for the AffectNet dataset, a dataset sampling strategy was introduced, whereby the low volume categories were upsampled and the high volume categories were downsampled.

4.1 Experimental design

RAF-DB [35] is a facial image dataset that contains 29,672 real-world scene images, labeled with seven basic expressions (anger, disgust, fear, joy, neutral, sadness, surprise) and eleven compound expressions. In the experiments, the focus was on identifying the seven basic expressions using a training set of 12,271 images and a test set of 3,068 images. For this purpose, each image was aligned and cropped to the same size (100×100 pixels).

AffectNet [36] is currently the largest wild facial expression database, containing 450,000 manually annotated facial images collected from the internet through three major search engines. It includes both AffectNet-7 and AffectNet-8 classification, as AffectNet-8 also includes the “contempt” class in addition to the aforementioned seven emotion categories. This dataset is comprehensive, as it includes not only images of people from different races, but also features background changes, lighting variations, and different postures and types of obstruction, among other factors that can affect FER. In our experiments, AffectNet-7 and AffectNet-8 were evaluated separately, whereby AffectNet-8 was assessed using 287,651 images as the training set for imbalanced classes, with 500 images for each class, and 4,000 images as the test data. For AffectNet-7, 283,901 images served as the training set for imbalanced classes, with 500 images for each class, and 3,500 images as the test data.

FERPlus [37] is a dataset of human facial expressions captured in real-world scenarios, obtained by relabeling the FER2013 dataset. It contains 28,709 training images, 3,589 validation images, and 3,589 testing images of 48×48 pixel size, each belonging to one of ten classes of extremely imbalanced expressions. To evaluate the model, in addition to anger, disgust, fear, joy, neutral, sadness, and surprise, the "contempt" class was also considered.

FED-RO is a dataset of 400 real-world facial expression images with variations in facial occlusion that are collected through Bing and Google search engines from which any overlaps with the RAF-DB and AffectNet datasets have been removed by Li et al. [38]. Each image in this dataset contains genuine and unique occlusion. In the experiments, the model was first trained on the AffectNet and RAF-DB training datasets and was subsequently tested by applying it to FED-RO.

4.2 Evaluation indicators

The proposed model is subjected to ablation experiments and comparisons with state-of-the-art methods on the publicly available datasets RAF-DB and, FERPlus and AffectNet. The evaluation indicators used is Accuracy (%), which is commonly used to measure the performance of a classification model and represents the ratio of the number of samples correctly predicted by the model to the total number of samples. It can be expressed as follows:

$Accuracy = \frac{TP + TN}{TP + TN + FP + FN}$ (14)

Where TP denotes the number of samples correctly categorized for positive samples, TN denotes the number of samples correctly categorized for negative samples, FP denotes the number of samples incorrectly categorized for positive samples, and FN denotes the number of samples incorrectly categorized for negative samples.

4.3 Ablation studies

An ablation analysis was conducted on public datasets RAF-DB, AffectNet-8, and FERPlus to demonstrate the effectiveness of the proposed method when applied to real-world scenarios. In the experiments, the performance of multi-feature convolution module and cross-attention module was evaluated, as well as the addition of SENet in the multi-feature convolution module and cross-attention module while varying the M value.

To verify the effectiveness of each module, a pre-trained backbone network was used in the ablation analysis. As shown in Table 1, the recognition rate of a single multi-feature convolution module on RAF-DB, AffectNet-8 and FERPlus is increased by 0.02%, 0.07% and 0.03%, respectively. When symmetric multi-feature convolution module applied to RAF-DB, AffectNet-8, and FERPlus, the recognition rate by 0.10%, 0.48%, and 0.07%, respectively, indicating that it can extract rich facial expression features for each sub-feature. Furthermore, when the SENet attention mechanism is added to the symmetric multi-feature convolution module, the recognition rates on RAF-DB, AffectNet- 8, and FERPlus are improved by 0.36%, 0.56%, and 0.35%, respectively. These results demonstrate that the module can obtain robust FER features for the overall sub-feature as well as alleviate the problem of missing expression information features. Moreover, the cross-attention module improves the recognition rate by 0.36%, 0.50%, and 0.61% on RAF-DB, AffectNet-8, and FERPlus, respectively, indicating that this module can obtain features from multiple regions to improve FER. Finally, when all these modules are combined into the algorithm, the recognition rates on RAF-DB, AffectNet-8, and FERPlus are improved by 1.37%, 0.96%, and 1.53%, respectively.

Table 1
Analysis of each module when applied to RAF-DB, AffectNet-8, and FERPlus (%)

Methods RAF-DB Affectnet-8 FERPlus

Baseline 88.69 60.22 87.66

Baseline+MF 88.71 60.30 87.69

Baseline+SMF 88.79 60.70 87.73

Baseline+SMF+SE 89.05 60.78 88.01

Baseline+CA 89.05 60.72 88.27

Baseline+SMF+SE+CA 90.06 61.18 89.19

Methods	RAF-DB	Affectnet-8	FERPlus
Baseline	88.69	60.22	87.66
Baseline+MF	88.71	60.30	87.69
Baseline+SMF	88.79	60.70	87.73
Baseline+SMF+SE	89.05	60.78	88.01
Baseline+CA	89.05	60.72	88.27
Baseline+SMF+SE+CA	90.06	61.18	89.19

Therefore, the decomposition of deep features and the decomposition of SENet in the proposed symmetric multi-feature convolution module can significantly reduce the sensitivity of deep convolution to occlusion, obtain rich and robust salient expression recognition features for each sub-feature, and alleviate the problem of missing facial expression information features. Additionally, the cross-attention module can simultaneously focus on multiple salient facial regions that have not been occluded, as well as mitigate interference from occlusion and enhance the model’s recognition ability. In sum, superior performance in real-world scenarios was achieved due to these enhancements in our method.

Given that the cross-attention module size (M) affects the model performance, its impact was assessed using the RAF-DB dataset and the findings are shown in Figure 7. As the cross-attention model is superior to the single attention model, in further experiments, two cross-attention modules were employed to enhance the recognition performance of the model.

Fig. 7

The results of ablation analysis of the local cross-attention module applied to RAF-DB.

Two traditional fusion strategies are compared: feature-level fusion and decision-level fusion. Feature-level fusion directly combines the feature vectors obtained from the two branches into a single feature vector for comprehensive analysis and processing, and trains the FER classifier to obtain the final recognition result. However, decision-level fusion fuses the recognition results of the two branches, and the final decision result is the global optimal decision. Table 2 indicates that decision-level fusion is clearly due to feature-level fusion.

Table 2

The two fusion strategies are compared on RAF-DB

Methods	Accuracy(%)
Feature-level fusion	88.53
Decision-level fusion	90.06

4.4 The MFCA-Net effectiveness

In this section, the proposed method is compared with state-of-the-art approaches by applying each to the RAF-DB, AffectNet-7, AffectNet-8, and FERPlus datasets.

1) Results on RAF-DB: The results shown in Table 3 indicate that the MFCA-Net method outperforms many state-of-the-art alternatives, achieving 90.06% accuracy. Compared to the attention-based methods such as RAN [19], LANet [40], LAViT [41], and DAN [29], the MFCA-Net outperforms the best-performing DAN method by 0.36%. Compared to the loss-based methods, i.e., DACL [22] and EAFR loss [42], the MFCA-Net method also outperforms the best-performing EAFR loss method by 0.26%. Likewise, our network outperforms the MA-Net [20] and MFNet [26] by 1.53%. At the same time, a confusion matrix is used to further illustrate the results, as shown in Figure 8(a), Disgust expression is easily confused with Sadness expression, possibly because these two expressions have similar facial features. The proposed MFCA-Net model achieves high accuracy on other categories of expressions.

Fig. 8

Confusion matrix of RAF-DB (a), FERPlus (b) and AffectNet-8(c).

Table 3

The results obtained by applying state-of-the-art approaches to the RAF-DB dataset

RAN[19]	2020	86.90
MA-Net[20]	2020	88.40
DACL[22]	2021	87.78
LANet[38]	2021	86.70
DAN[29]	2021	89.70
LAViT[41]	2022	87.48
MFNet[26]	2022	88.53
EAFR loss[42]	2022	89.80
MFCA-Net(Ours)	2023	90.06

2) Results on FERPlus: The results reported in Table 4 indicate that the MFCA-Net method achieves an accuracy of 89.19%, which is a 0.64% improvement relative to the attention-based method RAN [19] as well as 1.59% enhancement relative to the loss-constructed RW Loss [43]. Moreover, the MFCA-Net method outperforms other feature extraction network models. At the same time, a confusion matrix is used to further illustrate the results, as shown in Figure 8(b), Disgust expression, Anger expression and contempt expression are easily confused with other expressions, but the proposed MFCA-Net model performs well on other categories of expressions.

Table 4

Comparison of the proposed method with state-of-the-art alternatives when applied to the FERPlus dataset

RAN[19]	2020	88.55
ESRs[44]	2020	87.25
RW loss[43]	2020	87.60
VTFF[45]	2021	88.81
ADC-Net[46]	2021	88.90
SpResNet-ViT[47]	2022	88.10
MFCA-Net(Ours)	2023	89.19

3) Results on AffectNet: As the AffectNet dataset has been manually annotated with eleven facial expression classes, it was used along with AffectNet-8 to evaluate the effectiveness of the MFCA-Net method. However, given that the AffectNet dataset contains an imbalanced training set, a data sampling strategy was adopted during the training process. The aim was to rebalance the interclass distribution and automatically estimate the weight of the samples when sampling from the imbalanced dataset, while undersampling the majority class samples and oversampling the minority class samples. As can be seen from the results reported in Table 5, an accuracy of 65.74% and 61.12% is achieved by our method when applied to AffectNet-7 and AffectNet-8, respectively, thus outperforming most existing methods. Specifically, when compared to the attention-based methods DAN [29] and LAViT [41], the MFCA-Net method shows a slight superiority to DAN when applied to AffectNet-7. When tested on AffectNet-8, the MFCA-Net method’s performance is also superior to the other methods, with the exception of the attention-based methods DAN [29] and RAN [19]. At the same time, the confusion matrix is used to further illustrate the results on the AffectNet-8 dataset, as shown in Figure 8(c). The dataset itself is difficult to recognize due to various noises, but the MFCA-Net model proposed in this paper shows good performance on various types of expressions.

Table 5

Comparison of the AffectNet-7 and AffectNet-8 results with those obtained by the state-of-the-art methods

Methods	Years	Classes	Accuracy(%)
SNA-DFER[52]	2020	7	62.70
MA-Net[20]	2020	7	64.53
EfficientFace[24]	2021	7	63.70
MViT[48]	2021	7	64.57
DAN[29]	2021	7	65.69
LAViT[41]	2022	7	62.82
MAPNet[23]	2022	7	64.09
MFCA-Net(Ours)	2023	7	65.74
RAN[19]	2020	8	59.50
MA-Net[20]	2020	8	60.21
DAN[29]	2021	8	62.09
EfficientFace[24]	2021	8	59.89
MFNet[26]	2022	8	60.38
EAFR loss[42]	2022	8	61.05
MFCA-Net(Ours)	2023	8	61.12

4) MFCA-Net features: As shown in Figure 9, the feature extraction capability of MFCA-Net is demonstrated, that is, the highlighted area represents the feature extraction capability of MFCA-Net. When performing feature selection, MFCA-Net can select more useful features on images without occlusion. On the local occlusion image, useful features can be extracted from the local unoccluded regions. Through visual data analysis, MFCA-Net shows good feature extraction ability on both images without occlusion and occlusion.

Fig. 9

Feature extraction capability of MFCA-Net.

4.5 Evaluation of the effect of realistic occlusion on model performance

In order to evaluate the performance of the MFCA-Net model in real-world scenarios, it was applied to several datasets featuring realistic occlusion, including Occlusie-RAF-DB, Occlusie-FERPlus, Occlusion-AffectNet, and FED-RO. While the same experimental settings for the Occlusie-RAF-DB, Occlusie-FERPlus, and Occlusion-AffectNet datasets were adopted as reported in pertinent literature. For the FED-RO dataset, all training images from RAF-DB and AffectNet datasets were merged for fair comparison, and testing was conducted on FED-RO. Several examples of images from the real occlusion datasets are shown in Figure 10.

Fig. 10

Realistic occlusion sample images sourced from the RAF-DB (the first row), AffectNet (the second row), FERPlus (the third row), and FED-RO (the fourth row) databases.

As can be seen from Table 6 providing a comparison of the MFCA-Net method with state-of-the-art approaches, it achieves significantly higher performance on the Occlusie-RAF-DB, Occlusie-FERPlus, Occlusion-Aff-ectNet, and FED-RO datasets, with the accuracy rates of 87.74%, 86.28%, 62.66% and 71.25%, respectively. Thus, MFCA-Net demonstrates strong generalization ability as well as the potential for practical applications in real-world scenarios. It also effectively solves the significant challenge of local occlusion in FER.

Table 6

Comparison of the proposed method with state-of-the-art alternatives when applied to the Occlusie-RAF-DB, Occlusie-FERPlus, Occlusion-AffectNet, and FED-RO datasets with real occlusion

Datasets	Methods	Accuracy(%)
	ResNet18[19]	80.19
	RAN[19]	82.72
	EfficientFace[24]	83.24
Occlusion-RAF-DB	MA-Net[20]	83.65
	VTFF[45]	83.95
	MPCSAN[49]	86.26
	MFCA-Net(Ours)	87.74
	ResNet18[19]	73.33
	RAN[19]	83.63
Occlusion-FERPlus	VTFF[45]	84.79
	MPCSAN[49]	86.12
	MFCA-Net(Ours)	86.28
	ResNet18[19]	49.48
	RAN[19]	58.50
Occlusion-AffectNet	EfficientFace[24]	59.88
	MA-Net[20]	59.59
	MFCA-Net(Ours)	62.66
	gACNN[38]	66.50
	RAN[19]	67.98
	OADN[50]	71.17
FED-RO	LAENet-SA[51]	68.25
	MA-Net[20]	70.00
	MFCA-Net(Ours)	71.25

5 Conclusion

In this work, a multi-feature cross-attention network (MFCA-Net) was presented and applied to solve FER under local occlusion conditions. The findings yielded by the extensive analyses and experiments confirm that the proposed method can obtain robust multi-scale and local features. Specifically, MFCA-Net consists of multi-feature convolution and local cross-attention modules. The multi-feature convolution module decomposes deep features into multiple sub-features, extracts multi-scale features, and reduces the loss of expression information caused by deep convolution networks being affected by local occlusion. The local cross-attention module can simultaneously focus on multiple facial regions that have not been occluded, extract features from these regions, and alleviate the interference caused by local occlusion.

The experiments conducted on the public datasets RAF-DB, AffectNet, and FERPlus demonstrate the robustness and effectiveness of MFCA-Net. The feature extraction ability of MFCA-Net is demonstrated by dataset visualization. Compared to other state-of-the-art methods, MFCA-Net achieves optimal performance and performs equally well on datasets with real occlusion.

However, in a specific environment, the proposed method has limitations, as shown in Figure 11, blurred images, low quality and excessive occlusion area can lead to erroneous labeling of expressions. In the future work, will try to address these noise problems.

Fig. 11

The first row is the true label and the second row is the false label.

Footnotes

Acknowledgments

This research was supported by the National Natural Science Foundation of China (62241206) and the Science and Technology Projects of Guizhou Province (QKHJCZK2022YB195, QKHJCZK2023YB143, QKHPTRCZCKJ2021007, QKHJCZK2022YB550, ZCKJ2021007), the Youth Science and Technology Talent Growth Project of Guizhou Province (QJHKY2021104), the Natural Science Research Project of Education Department of Guizhou Province (QJJ2023061, QJJ2023012), the Education Evaluation Reform Pilot Project of Guizhou Province “Quality Evaluation of Teaching Process”.

References

Duric

, Gray

W.D.

, Heishman

, Li

, Rosenfeld

, Schoelles

M.J.

, Schunn

and Wechsler

, Integrating perceptual and cognitive modeling for adaptive and intelligent humancomputer interaction, Proceedings of the IEEE 90(7) (2002), 1272–1289.

Jeong

and Ko

B.C.

, Driver’s facial expression recognition in real-time for safe driving, Sensors 18(12) (2018), 4270.

Jin

H.L.

, Du

, Wen

, Zhao

, Shi

and Zhang

, A classroom facial expression recognition method based on attention mechanism, Journal of Intelligent & Fuzzy Systems Preprint (2023), 1–10.

Shan

, Gong

and McOwan

P.W.

, Facial expression recognition based on local binary patterns: A comprehensive study, Image and vision Computing 27(6) (2009), 803–816.

Zhao

and Pietikainen

, Dynamic texture recognition using local binary patterns with an application to facial expressions, Proceedings of the Transactions on Pattern Analysis and Machine Intelligence 29(6) (2007), 915–928.

Zhong

, Liu

, Yang

, Liu

, Huang

and Metaxas

D.N.

, Learning active facial patches for expression analysis, 2012 IEEE Conference on Computer Vision and Pattern Recognition (2012), 2562–2569.

Girshick

, Donahue

, Darrell

and Jitendra

, Rich feature hierarchies for accurate object detection and semantic segmentation, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2014), 580–587.

Wang

, Ding

and Wang

, Occluded Facial Expression Recognition using Self-supervised Learning, Proceedings of the Asian Conference on Computer Vision (2022), 1077–1092.

Liu

, Hirota

and Dai

, Patch attention convolutional vision transformer for facial expression recognition with occlusion, Information Sciences 619 (2023), 781–794.

10.

Fang

, Lin

, Wu

, An

and Sun

, Optimization of Facial Expression Recognition Based on Dual Attention Mechanism by Lightweight Network Model, Journal of Intelligent & Fuzzy Systems 45(5) (2023), 9069–9081.

11.

Zou

B.J.

, Guo

Y.D.

, He

, Ouyang

P.B.

, Liu

and Chen

Z.L.

, 3D filtering by block matching and convolutional neural network for image denoising, Journal of Computer Science and Technology 33 (2018), 838–848.

12.

Abbaszadeh Shahri

and Maghsoudi Moud

, Landslide susceptibility mapping using hybridized block modular intelligence model, Bulletin of Engineering Geology and the Environment 80 (2021), 267–284.

13.

Fang

, Lin

, Liu

, An

and Sun

, Triple Attention Feature Enhanced Pyramid Network for Facial Expression Recognition, Journal of Intelligent & Fuzzy Systems 44(5) (2023), 8649–8661.

14.

Prasad

A.R.

and Rajesh

, Hybrid Heuristic Mechanism for Occlusion Aware Facial Expression Recognition Scheme Using Patch Based Adaptive CNN with Attention Mechanism, Journal of Intelligent & Fuzzy Systems 17(3) (2023), 773–797.

15.

Park

S.J.

, Kim

B.G.

and Chilamkurti

, A robust facial expression recognition algorithm based on multi-rate feature fusion scheme, Sensors 21(21) (2021), 6954.

16.

Liu

Q.M.

and Xin

Y.Y.

, End-to-end Low quality facial image Expression recognition, Microcomputer System 41(3) (2020), 668–672.

17.

Poux

, Allaert

, Ihaddadene

, Bilasco

I.M.

, Djeraba

and Bennamoun

, Dynamic facial expression recognition under partial occlusion with optical flow reconstruction, IEEE Transactions on Image Processing 31 (2021), 446–457.

18.

, Zeng

, Shan

and Chen

, Patch-gated CNN for occlusion-aware facial expression recognition, 2018 24th International Conference on Pattern Recognition (ICPR) (2018), 2209–2214.

19.

Wang

, Peng

, Yang

, Meng

and Qiao

, Region attention networks for pose and occlusion robust facial expression recognition, IEEE Transactions on Image Processing 29 (2020), 4057–4069.

20.

Zhao

, Liu

and Wang

, Learning deep global multi-scale and local attention features for facial expression recognition in the wild, IEEE Transactions on Image Processing 30 (2021), 6544–6556.

21.

Woo

, Park

, Lee

J.Y.

and Kweon

I.S.

, Cbam: Convolutional block attention module, Proceedings of the European conference on computer vision (ECCV) (2018), 3–19.

22.

Farzaneh

A.H.

and Qi

, Facial expression recognition in the wild via deep attentive center loss, Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (2021), 2402–2411.

23.

and Zhao

, Mask-based attention parallel network for in-the-wild facial expression recognition, ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2022), 2410–2414.

24.

Zhao

, Liu

and Zhou

, Robust lightweight facial expression recognition network with label distribution training, Proceedings of the AAAI Conference on Artificial Intelligence 35(4) (2021), 3510–3519.

25.

Zhong

, Bai

, Li

, Chen

, Li

and Liu

, A graphstructured representation with brnn for static-based facial expression recognition, 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019) (2019), 1–5.

26.

Gong

, Wang

, Jia

, Qian

and Fan

, Multi-feature fusion network for facial expression recognition in the wild, Journal of Intelligent & Fuzzy Systems 42(6) (2022), 4999–5011.

27.

Ruan

, Han

, Sun

, Chen

and Li

, Facial expression recognition in facial occlusion scenarios: A path selection multinetwork, Displays 74 (2022), 102245.

28.

, Shen

and Sun

, Squeeze-and-excitation networks, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018), 7132–7141.

29.

Wen

, Lin

, Wang

and Xu

, Distract your attention: multi-head cross attention network for facial expression recognition, Biomimetics 8(2) (2023), 199.

30.

, Zhang

, Ren

and Sun

, Deep residual learning for image recognition, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), 770–778.

31.

Asheghi

, Hosseini

S.A.

, Saneie

and Shahri

A.A.

, Updating the neural network sediment load models using different sensitivity analysis methods: a regional application, Journal of Hydroinformatics 22(3) (2020), 562–577.

32.

Dupuis

, Novo

, O’Connor

and Bosio

, Sensitivity analysis and compression opportunities in dnns using weight sharing, 2020 23rd International Symposium on Design and Diagnostics of Electronic Circuits & Systems (DDECS) (2020), 1–6.

33.

Zhou

, Khosla

, Lapedriza

, Oliva

and Torralba

, Learning deep features for discriminative localization, Proceedings of the IEEE conference on computer vision and pattern recognition (2016), 2921–2929.

34.

Abbaszadeh Shahri

, Shan

and Larsson

, A novel approach to uncertainty quantification in groundwater table modeling by automated predictive deep learning, Natural Resources Research 31(3) (2022), 1351–1373.

35.

, Deng

and Du

J.P.

, Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017), 2852–2861.

36.

Mollahosseini

, Hasani

and Mahoor

M.H.

, Affectnet: A database for facial expression, valence, and arousal computing in the wild, IEEE Transactions on Affective Computing 10(1) (2017), 18–31.

37.

Barsoum

, Zhang

, Ferrer

C.C.

and Zhang

, Training deep networks for facial expression recognition with crowdsourced label distribution, Proceedings of the 18th ACM international Conference on Multimodal Interaction (2016), 279–283.

38.

, Zeng

, Shan

and Chen

, Occlusion aware facial expression recognition using CNN with attention mechanism, IEEE Transactions on Image Processing 28(5) (2018), 2439–2450.

39.

Guo

, Zhang

, Hu

, He

and Gao

, Ms-celeb-1m: A dataset and benchmark for large-scale face recognition, Computer Vision–ECCV:14th European Conference, Amsterdam, The Netherlands, October 11-14, Proceedings, Part III 14. Springer International Publishing (2016), 87–102.

40.

, Celik

and Li

H.C.

, Lightweight attention convolutional neural network through network slimming for robust facial expression recognition, Signal, Image and Video Processing 15 (2021), 1507–1515.

41.

Zhao

, Liu

and Liu

, Facial Expression Recognition Based on Visual Transformers and Local Attention Features Network, 2022 7th International Conference on Computer and Communication Systems (ICCCS) (2022), 228–231.

42.

Gong

, Fan

and Qian

, Effective attention feature reconstruction loss for facial expression recognition in the wild, Neural Computing and Applications 34(12) (2022), 10175–10187.

43.

T.H.

, Lee

G.S.

, Yang

H.J.

and Kim

S.H.

, Pyramid with super resolution for in-the-wild facial expression recognition, IEEE Access 8 (2020), 131988–132001.

44.

Siqueira

, Magg

and Wermter

, Efficient facial feature learning with wide ensemble-based convolutional neural networks, Proceedings of the AAAI Conference on Artificial Intelligence (2020), 5800–5809.

45.

, Sun

and Li

, Facial expression recognition with visual transformers and attentional selective fusion, IEEE Transactions on Affective Computing (2021).

46.

H.Y.

, Li

, Tan

, Li

and Song

, Destruction and reconstruction learning for facial expression recognition, IEEE MultiMedia 28(2) (2021), 20–28.

47.

Gao

, Li

and Zhao

, Facial Expression Recognition Method Based on SpResNet-ViT, 2022 2nd Asia-Pacific Conference on Communications Technology and Computer Science (ACCTCS) (2022), 182–187.

48.

, Sui

, Zhao

, Zha

and Wu

, MVT: mask vision transformer for facial expression recognition in the wild, arXiv preprint arXiv:2106.04520 (2021).

49.

Gong

, Qian

and Fan

, MPCSAN: multi-head parallel channel-spatial attention network for facial expression recognition in the wild, Neural Computing and Applications 35(9) (2023), 6529–6543.

50.

Ding

, Zhou

and Chellappa

, Occlusion-adaptive deep network for robust facial expression recognition, 2020 IEEE International Joint Conference on Biometrics (IJCB) (2020), 1–9.

51.

Wang

, Xue

, Lu

and Yan

, Light attention embedding for facial expression recognition, IEEE Transactions on Circuits and Systems for Video Technology 32(4) (2021), 1834–1847.

52.

, Wu

, Li

, Pan

and Luo

, Semantic neighborhood-aware deep facial expression recognition, IEEE Transactions on Image Processing 29 (2020), 6535–6548.