Abstract
To solve the problem of limitations of current attention mechanisms in extracting key facial expression features and the problem of low accuracy of facial expression recognition due to insufficient consideration of feature information fusion in the receptive field by convolutional neural networks. In this paper, we propose a facial expression recognition network based on wide attention (WA) and a multi-scale fusion (MF) mechanism (wide attention and multi-scale fusion [WAMF]). WA by extracting the background information of facial expression images while focusing on texture information, thus achieving better feature extraction. The MF mechanism is added at the connection points of layers in ResNet, where features extracted from each upper layer are fused using different-sized convolutional kernels and input into the lower layer. Finally, a viewpoint-invariant Capsule Net is used as the classification network after receiving the feature maps. The proposed WAMF model was applied to two publicly available datasets, CK+ and Jaffe, achieving excellent recognition rates of 98.98% and 98.46%, respectively.
Introduction
Human emotions refer to the genuine feelings experienced by an individual after being influenced by external factors, which can be expressed through various means such as facial expressions, gestures, voice, and text. Particularly, facial expressions convey up to 55% of emotional information (Pantic and Rothkrantz, 2000), making it a hot research area in the field of emotional recognition. Facial expressions are also the most natural and intuitive way to judge emotions. Facial expression recognition is a branch of computer vision that is widely used in real-world scenarios, such as fatigue driving (Jeong and Ko, 2018), assisted medical treatment (Li et al., 2019). The process of facial expression recognition includes three stages: facial expression image pre-processing, facial feature extraction, and expression classification. Currently, the basic methods for expression recognition are traditional methods and deep learning-based methods, and the distinction is based on whether or not a deep learning method is used.
Traditional methods for extracting facial expression features include hidden Markov models (Bystroff et al., 2000), bayesian classification (Vinay et al., 2018), and support vector machine feature classification (Suykens and Vandewalle, 1999). However, these methods have the limitations of difficult feature extraction and low recognition accuracy and researchers prefer deep learning research methods. After Hinton and Salakhutdinov (2006) proposed a method to solve the problem of gradient disappearance, deep learning methods have become widely used, and most researchers have turned to using deep learning methods. Deep learning techniques make it easier to extract features of facial expressions, and the accuracy of recognition has been greatly improved. Liu et al. (2021) proposed an expression recognition structure that combines long short-term memory (LSTM) networks and capsule networks, which consists of three parts: capsule encoder, capsule decoder, and LSTM module. The LSTM module can effectively extract differences and features between adjacent video frames to improve the accuracy of video representation recognition. The combination of the capsule network and LSTM achieved good results in facial expression recognition in videos. Cao et al. (2020) proposed a double-enhanced Capsule Neural Network E2-Capsnet, which includes two feature enhancement modules. The first module uses action unit Attention to focus on the important part, multiple convolutional layers are added to the capsule site to form the second module that enhances feature representation, resulting in better recognition.
Recent advancements in the field, such as the work by Xiao et al. on dense knowledge-aware networks for multivariate time series classification (Xiao et al., 2024a), and deep transformer capsule mutual distillation for multivariate time series classification (Xiao et al., 2024c), highlight the importance of advanced feature extraction techniques. Although primarily focused on time series data, the principles of these methods can be adapted to improve facial expression recognition systems by enhancing feature learning capabilities. Furthermore, Xiao et al. (2024b) proposed a deep contrastive representation learning method with self-distillation, which integrates data augmentation, deep contrastive learning, and self-distillation to address deficiencies in existing contrastive learning algorithms. This approach can significantly improve the performance of time series classification and clustering, and its principles can be applied to enhance facial expression recognition.
While deep learning methods show excellent performance in facial expression recognition, facial expression images often contain interference information from non-expression regions which can greatly affect recognition accuracy. In recent years, attention mechanisms have become a hot topic in deep learning research, and many excellent facial expression recognition networks have used attention mechanisms, as they can simulate the learning mechanism of humans for new things, mainly by focusing neural networks on key target features and increasing the weight of these features to enhance recognition. Hu et al. (2018) proposed an attention mechanism that considers the channel aspect of the feature graph called the Squeeze and Excite Network (SENet) channel attention mechanism in 2017, which enhances the weights of important channels, making the network focus on different channel feature maps with different attention weights. The convolutional block attention module (CBAM), proposed by Woo et al. (2018) in 2018, overcomes the deficiencies of the SENet approach by extracting features from two dimensions of channels and space. Hou et al. (2021) proposed a relatively novel attention coordination mechanism in 2021 as a modification of CBAM to incorporate the coordinate location information of features into the channel attention during feature extraction, thereby capturing cross-channel information, directional perception, and position-sensitive information.
Additionally, the recent work by Tarek et al. (2023) on eye detection-based deep belief neural networks and the CapMatch method by Xiao et al. (2023) emphasize the significance of robust feature extraction techniques under varying conditions, which can also be relevant to improving the robustness of facial expression recognition systems. Furthermore, the approach by ElAraby and Shams (2021) on face retrieval systems and the work by Sarhan et al. (2020) on multipose face recognition highlight the need for adaptive and efficient algorithms in face-related recognition tasks, which can provide insights into improving facial expression recognition accuracy and robustness.
In order to more effectively capture facial expression features and fuse extracted features at a deeper level to enhance the representation of relevant features, this paper proposes a novel WAMF (wide attention and multi-scale fusion) network based on the WA (wide attention) and MF (multi-scale fusion) mechanisms for facial expression recognition.
The contributions of this study include the following:
We propose a WAMF network for facial expression recognition tasks, which innovatively integrates feature extraction and fusion steps, enhancing the capture of feature information. The WA mechanism can attentively focus on various aspects of the image, including its horizontal and vertical dimensions, texture, and background. This in-depth and comprehensive attention mechanism effectively mines the information embedded in the image, demonstrating strong performance in the image feature extraction step. We have devised an MF mechanism for ResNet, which integrates information from shallow and deep layers at various scales, facilitating feature interaction. Comparative and ablation experiments were conducted on publicly available facial expression datasets CK+ and Jaffe, demonstrating the effectiveness and robustness of WAMF.
Related work
Since the widespread popularity of convolutional neural networks (CNNs) in deep learning, they can be seen frequently in the field of image processing. ResNet is a very well-known CNN structure proposed by He et al. (2016) in 2015, stands out among various CNNs due to its multi-level superposition advantage. Qian et al. (2022) designed an expression recognition network with a backbone network of ResNet34 by performing feature extraction from the head and tail part of the network, and Jiang and Huang (2022) used ResNet50 for feature extraction at the channel level of the image. While ResNet mitigates the vanishing and exploding gradient problems, it does not address the interaction of features between shallow and deep layers. Therefore, this paper introduces a MF mechanism, which considers feature fusion in ResNet from a multiscale perspective.
Attention mechanisms, as a hot topic in deep learning research, are closely intertwined with image processing tasks. In the context of facial expression tasks, various networks tend to segment facial image features in a somewhat similar manner, yet the effectiveness of different attention mechanisms varies significantly. Various attention mechanisms are widely used in facial expression recognition tasks. Qian et al. (2022) proposed a strong attention mechanism in 2021 by adding it before and after residual modules to strengthen the extraction of key information and achieved excellent results. Zhang et al. (2022) proposed an expression recognition network based on channel attention mechanisms, mainly focusing on enhancing key feature information representation through channel attention mechanisms. Jiang and Huang (2022) introduced a ResNet50 model with multiple feature layers that have a channel attention module for facial expression recognition, utilizing channel attention mechanisms to enhance key feature channels and improve recognition accuracy. However, most attention mechanisms are not sufficiently effective for feature extraction in facial expression recognition tasks, and some CNNs lack feature fusion at the receptive field level, leading to poor facial expression recognition accuracy. Hence, we have designed an attention module to enhance the feature extraction capability of the network.
Methodology
Overview
We propose a network called WAMF to address the issues of insufficient feature extraction and information fusion in facial expression recognition tasks. As shown in Figure 1, WAMF consists of three components: image preprocessing, an enhanced ResNet module, and a capsule network module. Firstly, we crop facial expression images to remove irrelevant information. Next, these cropped images are fed into an enhanced ResNet-18 network. In this improved ResNet-18 network, we introduce WA mechanisms in each layer to enhance the capture of feature information, and we incorporate MF mechanisms between every two layers to facilitate information fusion. Finally, the extracted feature maps are input into a capsule network with viewpoint invariance for classification, yielding the ultimate recognition results.

WAMF network structure.

WA structure diagram.
Maximum pooling mitigates feature extraction errors resulting from shifts in the estimated mean value due to convolutional layer parameter errors, thereby preserving more texture information. On the other hand, mean pooling reduces feature extraction errors associated with an increase in estimated variance due to neighborhood size limitations, allowing for the retention of more background image information.
The formula of max-pooling is given in equation (1)
The formula of mean-pooling is given in equation (2)
WA is a novel attention mechanism introduced in this paper that effectively captures both background and texture information from images. Figure 2 illustrates the internal structure of WA. Specifically, the process begins by applying average pooling and max pooling separately along the spatial axes (horizontal and vertical) of the input feature maps. Next, the results of average pooling along the horizontal and vertical directions are recombined, as are the results of max pooling. These recombined results are then separated along both horizontal and vertical axes. Subsequently, the separated results along the horizontal axis are recombined, as are the results along the vertical axis. The final step involves combining these two recombined results to calculate the weight for each pixel in the feature map. These weights are then used to perform element-wise multiplication with the input feature map, resulting in the output feature map.
Given an input feature map of dimensions
MF refers to the process of sampling input signals at different granularities to obtain distinct features and then integrating them using a specific method to extract more information. In this paper, the MF designed for ResNet involves short-circuiting four layers of ResNet. This approach allows the lower layers to receive all the feature information from the upper layers, thereby achieving the goal of fusing a broader range of facial feature information while maintaining the original channel count and image size of ResNet.
In the original ResNet architecture, the outputs of each layer are partially connected to the intermediate output and serve as input to the subsequent layer through short connections. However, as the hierarchy deepens within ResNet, the size of feature maps gradually decreases. This connection method does not take into consideration the varying sizes of the feature maps extracted at different layers and the crucial facial key points. Instead, it employs a uniform 3 × 3 convolutional kernel size. Consequently, this approach may result in incomplete extraction of important features for certain expressions. Hence, we have devised an MF mechanism that enables all deeper layers to receive information carried by all shallower layers.
As illustrated in Figure 3, the specific approach of MF involves several steps. Firstly, the output feature map of the first layer is concatenated with the input and then subjected to a 1 × 1 convolution, ensuring that the output feature map maintains its original size and channel count. For the second layer, its input undergoes a 7 × 7 kernel convolution to enlarge its receptive field, followed by concatenation with the output of the second layer and convolution, maintaining an output channel count of 128. The output feature map of the second layer is then short-circuited with the intermediate input of the third layer, resulting in an input channel count of 256 for the third layer. In the third layer, the input feature map is convolved with a 5 × 5 kernel, followed by concatenation with the output feature map, keeping the channel count at 256. The output feature map of the third layer is short-circuited with the intermediate output of the fourth layer, resulting in an input channel count of 512 for the fourth layer. Finally, the input of the fourth layer undergoes a 3 × 3 kernel convolution, followed by concatenation and convolution with the output feature map, ensuring that the final output feature map maintains a channel count of 512.

Structure of the layer after adding the MF mechanism.
Given an input feature map at layer l with dimensions
Experimental environment
All experiments are based on Python 3.7 and CUDA version 11.0. The network is built using PyTorch 1.7.1 for training and testing on Windows 11 operating system. The hardware conditions used are an Intel i9-12900H CPU; 16GB of memory; and an NVIDIA RTX 3070 graphics card with a memory size of 8 GB.
Dataset
The datasets used in this paper are the Jaffe (Lyons et al., 1998) and CK+ (Lucey et al., 2010), which are collected in a laboratory environment. The Jaffe dataset contains 213 images of facial expressions of 7 emotional states performed by 10 Japanese female students: anger (30 images), disgust (29 images), fear (32 images), happiness (31 images), neutral (30 images), sadness (31 images), and surprise (30 images). Figure 4 shows the sample images of expressions from the Jaffe dataset, and Figure 5 shows the data distribution of the seven expressions from Jaffe. The CK+ dataset is obtained by expanding the Cohn-Kanade dataset, and it includes 123 participants, 69% of whom are women, 81% are European-American, 13% are African-American, and 6% are from other ethnic groups. The CK+ dataset consists of 593 image sequences contributed by the participants, and 327 of these sequences are selected with labeled information. We selected three peak frames from the labeled sequences and ended up with 981 images of facial expressions, containing seven emotions: anger (135 images), contempt (54 images), disgust (177 images), fear (75 images), happiness (207 images), sadness (84 images), and surprise (249 images). Figure 6 shows the sample images of expressions from the CK+ dataset, and Figure 7 shows the data distribution of the seven expressions from CK+.

Example of Jaffe dataset.

Jaffe data distribution.

Example of CK+ dataset.

CK+ data distribution.
We divided CK+ and Jaffe into training and test sets in the ratio of 7:3. To reduce the interference of non-face areas in the images, we cropped the center regions of each image for face extraction, as it is more beneficial for feature extraction and emotion analysis, and reduces the interference of irrelevant features. Before training, the data was unified for normalization to facilitate better gradient diffusion and faster model convergence. In the experiments, the learning rate was set to 0.001, batch size was 16. For each experiment, we trained for 200 epochs.
The evaluation metrics for this experiment are standardized measures used in image classification, including
In this section, we first conducted a comparison experiment and obtained the accuracy of WAMF on the CK+ dataset as 98.98%, and on the Jaffe dataset as 98.46%. Figures 8 and 9 show the final accuracy curves and loss curves of WAMF on CK+ and Jaffe, respectively. We can see that WAMF has excellent fitting ability and fast convergence speed, and the accuracy increases with the number of rounds, while the loss reduces with the decrease of rounds. There is no occurrence of overfitting, which supports the conclusion that the network designed in this study has a strong generalization ability for facial expression recognition tasks. Second, we conducted an accuracy comparison experiment to compare the performance of WAMF with other methods proposed in other papers on the CK+ and Jaffe datasets. The experimental results show that the WAMF method has better performance on these two datasets compared to the previous methods.

(a) Accuracy on CK+, (b) loss on CK+.

(a) Accuracy on CK+, (b) loss on CK+.
To prove the performance of WAMF, we conducted a comparative analysis with several existing networks. The classification results, presented in Tables 1 and 2, reveal the efficacy of different networks on the CK+ and Jaffe datasets, respectively.
Accuracy comparison with other models on the CK+ dataset.
DCNN: deep convolution neural networks; DeRL: de expression residue learning; FERAtt: facial expression recognition with attention net; NTF-LRS: nonnegative tensor factorization method is proposed based on low-rank subspace; OAENet: oriented attention enable network; SCAN: spatio-channel attention net.
Accuracy comparison with other models on the Jaffe dataset.
ARLCP: adaptive robust local complete pattern; C-G-ECA-R34: CBAM-global-efficient channel attention-ResNet34; COMP-GAN-CLS: compositional generative adversarial network classifier; DCNN: deep convolution neural networks; HDG: histogram of directional gradient; HDGG: histogram of directional gradient generalized; MFF-CNN: multi-feature fusion based convolutional neural network; SCAN: spatio-channel attention net; SCNN: shallow convolutional neural network; WMDCNN: weighted mixture deep convolution neural networks.
Table 1 illustrates the accuracy improvement of WAMF compared to other networks on the CK+ dataset, ranging from 0.30% to 3.87%. Notably, our proposed WA mechanism exhibits superior efficacy in key feature extraction in contrast to deep convolution neural networks (Mohan et al., 2021), which employs traditional feature extraction methods combined with deep learning for recognition. Additionally, the use of deep learning enables WAMF to extract shallow features more effectively. Rao et al. (2021) proposed a multi-scale graph CNN for facial expression recognition, the main approach of this network is to fuse the multi-scale structural graph generated by segmentation with the original graph for facial expression classification. However, this approach can only extract 98 facial features and may overlook smaller facial expression features. The WA mechanism can focus on the background and texture of the image with information and has better face feature extraction capability.
Table 2 demonstrates that the accuracy of WAMF in the Jaffe dataset surpasses that of other networks. Avani et al. (2021) divided the face region into four separate parts for feature extraction. In contrast, the MF mechanism based on ResNet, which we proposed in this paper, enables the fusion of features extracted at various scales. As a result, the information on individual feature components of facial expressions can be interpreted appropriately, thus improving the overall recognition accuracy and generalization ability of the model. Our research shows that employing ResNet18 as the backbone network yields optimal performance.
In this section, two sets of ablation experiments are designed to demonstrate the effectiveness of the proposed WA and MF mechanisms from the selection of network structure and the splitting of WAMF, respectively, and finally prove that WAMF is optimal, the experiments use the CK+ and Jaffe datasets on ResNet18 and ResNet34, respectively. After adding WA and MF, respectively, we compare the effectiveness of ResNet18 and ResNet34.
Tables 3 and 4 show the accuracy comparison of ResNet18 and ResNet34 on the CK+ dataset, and Tables 5 and 6 show the accuracy comparison of ResNet18 and ResNet34 on the Jaffe dataset. Table 3 shows that when ResNet34 is used as the backbone network, a maximum accuracy of 98.98% can be achieved on the CK+ dataset, whereas Table 4 indicates that when ResNet34 is used as the backbone network, a maximum accuracy of 98.06% can be achieved on the CK+ dataset, which is a decrease of 0.92% in comparison to the use of ResNet18. Whereas Table 5 shows that when ResNet18 is used as the backbone network, 98.46% accuracy can be achieved on the Jaffe dataset, Table 6 shows that when ResNet34 is used, only 96.92% recognition accuracy can be achieved on the Jaffe dataset, which is 1.54% lower compared to using ResNet18.
Accuracy of different models in the CK+ dataset under ResNet18.
Accuracy of different models in the CK+ dataset under ResNet18.
Accuracy of different models in the CK+ dataset under ResNet34.
Accuracy of different models in the Jaffe dataset under ResNet18.
Accuracy of different models in the Jaffe dataset under ResNet34.
Figures 10 and 11 show the accuracy graphs we obtained using ResNet18 and ResNet34 on the CK+ and Jaffe datasets, respectively. From Figure 10, it can be seen that on the CK+ dataset, when using ResNet34, the two ways: without adding any improvement and adding MF mechanism achieve higher accuracy than ResNet18, but when adding WA and using the WAMF method, ResNet18 achieves better results. Figure 11 shows that all four ways of ResNet18: without adding improvement, adding MF mechanism, adding WA, and using WAMF method on Jaffe dataset are due to ResNet34, so we choose ResNet18 as the backbone network.

Accuracy of the two networks on CK+.

Accuracy of the two networks on Jaffe.

Confusion matrix for ablation experiments on the CK+ dataset. (a) Nothing, (b) MF, (c) WA, (d) WAMF.

Confusion matrix for ablation experiments on the Jaffe dataset. (a) Nothing, (b) MF, (c) WA, (d) WAMF.
Figure 12(a) shows the confusion matrix obtained for ResNet18 and Capsule Net without adding any improvement mechanism, where 24 expressions were incorrectly identified. After adding the MF mechanism, the number of errors is reduced to 19, as shown in Figure 12(b). The WA mechanism, on the other hand, further improves the performance with six expressions incorrectly recognized, as can be seen in (c). The performance peaks when the WA and MF mechanisms are combined, as shown in Figure 12(d) with only three errors.
Figure 13 shows the confusion matrix obtained from our ablation experiments on the Jaffe dataset. As shown in Figure 13(a), a total of five expressions are incorrectly identified when using only the network with the combination of ResNet18 and Capsule Net. (b) Shows the results after adding MF, where one expression classified as “disgust” is correctly recognized compared to the network without any improvement. (c) Shows the results after adding WA, where there is a big improvement, with two expressions incorrectly classified. Figure 13(d) shows the results after using WAMF, where the best performance is achieved with only one expression misclassification.
Tables 7 and 8 show the precision, recall, and F1 obtained from WAMF experiments on the CK+ and Jaffe datasets, respectively, and it can be seen that WAMF achieves excellent results on the CK+ and Jaffe datasets. For the CK+ dataset, only three “sadness” expressions were misidentified as “contempt,” while only one “fear” expression was misidentified as “sadness” on Jaffe.
Precision, recall, and F1 under the CK+ dataset.
Precision, recall, and F1 under the Jaffe dataset.
This paper presents a facial expression recognition network based on WA and MF mechanism. Firstly, the wide and comprehensive attention module facilitates the extraction of background and texture information, while the MF mechanism enables the fusion and interaction of deep and shallow features, leading to a more comprehensive understanding of facial expression features. WAMF achieves impressive results on publicly available facial expression datasets, achieving 98.98% accuracy on CK+ and 98.46% on Jaffe. The experimental results show that although WAMF has more excellent results in face expression recognition, the network structure is more complex, with more parameters, and the overall network is relatively bulky, so how to optimize the network for lightweight in future research is the next research direction.
Footnotes
Funding
This work is supported by the Education Department of Shaanxi Provincial Government Service Local Special Scientific Research Plan Project under grant number (No. 22JC037).
Conflict of interest
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Correction (May 2025):
Since the original publication, 2 references “Michael et al. 1998” and “Lucey et al. 2010” have been included in the article.
