Abstract
Facial expression recognition is a current research hotspot and can be applied to computer vision fields such as human-computer interaction and affective computing. The lack of diversity and category recognition information in the neural network input may affect the performance of the network, resulting in insufficient extraction of facial expression features. In order to address the above problems, a lightweight deep convolution neural network with convolution block attention module is proposed in this paper. The implementation of the lightweight DNN relies on the use of deep separable convolution and residual blocks. The combination of the convolution block attention module and the improved classification function can optimize the lightweight model. We use accuracy and confusion matrix to evaluate different models, ultimately achieving 71.5% and 99.5% accuracy on the Fer2013 and CK+ datasets respectively. The experimental results show that our model has good feature representation capabilities.
Introduction
People infer the emotional states of others, such as joy, sadness and anger, from facial expressions and tone of voice [1]. Among the many non-verbal components, facial expressions are one of the main information channels of interpersonal communication, carrying emotional meaning. As a result, the study of facial emotion has received extensive attention not only in the field of perception and cognitive science, but also in the field of affective computing and computer animation in the past few decades [2]. With the development of artificial intelligence, human-computer interaction has better facilitated the research on facial expression recognition. A general facial expression recognition system consists of three steps, namely face detection [3], expression feature extraction and facial expression classification, as shown in Fig. 1. Among them, face detection algorithm and classification algorithm have been relatively mature, so the current work focuses on feature extraction of facial expressions.

Facial expression recognition system.
Feature extraction has been a difficult problem in traditional machine learning research [4]. Most of them used manual feature extraction methods [5–7], and then use the manually extracted features to train the shallow classifier to classify the expressions. These traditional manual feature extraction methods have made important contributions to the study of facial expression recognition. However, manual feature extraction is subject to human interference, resulting in insufficient generalisation of the model to achieve high recognition accuracy. Given the size and diversity of the dataset, deep convolutional neural networks (DCNN) are the most suitable technique for image classification among all classification methods [8]. Deep convolution neural networks can automatically extract more abstract high-level features or attribute features, thus improving the final classification or prediction accuracy [9, 10].
In summary, this paper proposes a deep convolutional neural network based on an attention mechanism for facial expression recognition. The main contributions of this paper can be described as two points: First, a new lightweight convolutional neural network is designed, including depthwise separable convolution and convolutional residual blocks. Using depthwise separable convolution instead of convolutional layer in the model can greatly reduce the number of parameters in the model to avoid gradient disappearance. The network model is too large and the accuracy decreases with the deepening of the network. Another is to use the attention mechanism in combination with the improved loss function to further improve the feature expression ability of the network model. Through the experimental comparison with different models, the results show that our method can fully extract facial expression features and effectively improve the expression recognition rate.
The remainder of the paper is organized as follows. In Section 2, we describe the related work and the main techniques used in this paper; Section 3 describes the model we proposed in detail, Section 4 describes dataset we used, Section 5 discusses the experiment setup and results. Section 6 summarises the work.
DCNNs have been used with considerable success for image classification tasks. For example, Simonyan et al. [11] proposed visual geometry group network (VGGNet), which uses only small convolutional filters (3×3) in all layers and proved its effectiveness. However, the blind deepening of the network may introduce new problems such as overfitting and difficult training, so Szegedy et al. [12] proposed a new neural network architecture called Inception, which converts the network into a sparse structure and introduces depthwise separable convolutions. Ho et al. proposed residual network (ResNet) [13] which adds shortcuts between networks to create skip connections. These networks have inspired our network design.
In recent years, researchers have applied DCNNs to facial expression recognition with good results. Jain et al. combined convolutional neural networks and recurrent neural networks (RNN) [14] for Facial Expression Recognition [15]. Saurav et al. proposed an emotion network (EmNet) architecture for real-time automatic recognition of facial expressions in natural environments, which eventually achieved an accuracy of 87.16% on the RAF-DB dataset [16], a significant improvement in accuracy compared to the current state-of-the-art [17]. Mohammed et al. proposed a new facial expression recognition technique that com- bines convolutional neural networks with the k-nearest nerghbor (KNN) [18] algorithm in machine learning [19].
In addition, the combination of attention mechanism [20] and image processing can further enhance the performance of the network. The attention mechanism allows the neural network to ignore irrelevant information and focus on valid information. Jaderberg et al. proposed a spatial transformer module through the attention mechanism to transform the spatial domain information of the image into the corresponding space, thereby extracting the region of interest in the image [21]. Hu et al. proposed the squeeze-excitation (SE) block [22], which is used to select feature maps relevant for object recognition. However, SE block only considers the role of channels in image classification, but ignores the spatial position of feature maps in the image. The spatial domain transformation network also has similar problems. It ignores the influence of channels in feature maps. To this end, Woo et al. proposed the convolutional block attention module (CBAM) [23], which is an attention mechanism module that combines spatial and channel attention, unlike SE blocks that only focus on channels. For facial expression recognition, we may focus on the corners of the mouth and the eye area rather than other areas, and adding an attention module to the model is a suitable way to perform this attention mechanism.
Depthwise separable convolution
Depthwise separable convolution is a special form of convolution, first proposed by Howard et al. in 2017 and applied to the MobileNet [24] network. The depthwise separable convolution decomposes the standard convolution into depthwise convolution and point convolution (1×1 convolution). Depthwise convolution performs depthwise convolution on each input channel separately and then blends the output channels by point convolution. This operation of decomposing the convolution can greatly reduce the computational cost.
Suppose a feature map of size DF × DF × M as the input to the standard convolution, and the output feature map is DG × DG × N, where D
F
is the spatial width and height of the input feature map, M is the input depth, D
G
is the spatial width and height of the output feature map, and N is the input depth. The size of the convolution kernel is DK × DK, then the computational cost of a standard convolutional layer is named C
conv
as shown in Equation (1).
The computational cost of depthwise separable convolution named C
dsconv
is simply the linear sum of the deep convolution and the point convolution, as shown in Equation (2).
By splitting the normal convolution into a combination of depthwise convolution and point convolution, the amount of parameter calculation can be greatly reduced. The specific parameters are com- pared below:
When we use a depthwise separable convolution with a convolution kernel of 3×3, the computation is 8 to 9 times less than a standard convolution with a convolution kernel of the same size, with essentially no difference in accuracy..
Theoretically, adding new layers to a neural network can effectively reduce the training error after sufficient training, because the solution set of the original neural network model is only a subset of the solution set of the new model after the new layers are added. In other words, the new model only needs to train the mapping of the newly added layer into an identity mapping, as in f (x) = x [13] to achieve the same effective effect as the original model. To address this problem, Ho’s team proposed a residual network
A skip connection structure is used in the residual network. Skip connections can take activations from one layer of the neural network and pass them through intermediate network layers to deeper layers of the neural network. The structure of the residual block is shown in Fig. 2. Assume that the input to the hidden layer is x, the output of the hidden layer is f (x), and the activation function is Activation, then the final output of the residual block is given by Equation (4).

The structure of residual block.
Figure (2-a) shows the identity residual block, which is also the standard residual block corresponding to the case where the input and output dimensions are consistent. Figure (2-b) shows another type of residual block, called a convolutional residual block, which is mainly used in the case where the input and output sizes are different.
Attention mechanisms allow neural networks to ignore irrelevant information and focus on valid information. The convolutional block attention module (CBAM) [23] is a lightweight and general attention mechanism module. Figure 3 shows the structure of the CBAM. In CBAM, the input intermediate feature map is F ∈ RC×H×W, in which C represents the number of channels, H represents the height of the feature map, and W represents the width of the feature map, The channel attention module and spatial attention module in CBAM process the F as follows:

The structure of CBAM.
Both the depth-separated convolution and the residual blocks, which are designed to reduce the network parameters to widen the field of perception and extract more accurate image features by sampling, have greatly inspired us and facilitated our innovation. CBAM further facilitates the network’s focus on useful images, effectively improving the performance of facial expression recognition.
In the following subsections, the details of the proposed model are presented.
The proposed DNN
Our proposed DNN model refers to the structural idea of VGGNet [11]. The convolution kernel size of all convolutional layers in VGGNet is 3×3, and the core part of the network is mainly the stacking of convolutional blocks. The number of convolutional layer channels in each convolutional block is double that of the previous block, and the number of channels in the first convolutional block is 64. To avoid overfitting caused by the complexity of the network, we first replace the convolution layer in the core module with a combination of depthwise separable convolution and residual blocks. The number of deeply separable convolution parameters and the computational cost are relatively low compared to traditional convolution operations. The fully-connected layer is then replaced by a global average pooling in the output section, and finally the number of channels in the first convolution block is also reduced to 32.
In summary, the proposed DNN model consists of three modules: input module, intermediate module and output module. As shown in Fig. 4. Different graphic styles in Fig. 4 represent different structures The input module transmits the picture to the middle module through two 3*3 convolution layers. The middle module is formed by stacking a convolution residual block composed of a deep convolution block and a 1*1 convolution block. the depth convolution block is composed of two deep separable convolution layers and a maximum pool layer. the number of deep separable convolution channels in each layer is 32,64,128,256 respectively. Finally, the image is output through an output module composed of a 3*3 convolution layer and an average pooling layer. Each convolution layer and deep separable convolution layer are activated using batch normalization [25] and rectified linear unit (ReLU) [26]. The batch normalization (BN) layer normalizes the incoming data from the previous layer according to each training batch, and normalizes the incoming sample data in each batch to a standard normal distribution. Applying the BN layer to the neural network can not only speed up the convergence speed of the network model, thus speed up the training speed, but also improve the generalization ability and training accuracy of the model. The specific network is shown in Fig. 4. The use of deep separable convolution layer in DNN can greatly reduce the number of parameters of the model.

The structure of DNN.
Details on how to improve the proposed model are presented in the following subsections.
Attention mechanism
We propose a DNN network with deep separable convolution and residual modules. However, the lack of diversity and category discernibility information in the input of DNN may affect the recognition performance of the network. therefore, in order to emphasize or select the important information input and suppress some irrelevant details, we introduce the convolution block attention module (CBAM) into the model. CBAM will calculate the attention map for the given feature graph along the channel and spatial dimensions in turn. Then the input feature graph and the attention map are multiplied to refine the adaptive features.
Inserting a CBAM block at different depths of detachable convolution layers will inevitably increase the complexity of the network architecture and training process. Although CBAM is a lightweight module, overuse will inevitably destroy its lightweight features. Therefore, we try to insert a CBAM into different locations in the network, and in order not to destroy the integrity of the network, we try to insert the CBAM into the input module or the intermediate module. By comparing with the convergent loss curve of the original network model, we can see that the effect of inserting CBAM between the intermediate module and the output module is the best, as shown in Fig. 6. Therefore, this paper inserts CBAM between the intermediate module and the output module of DNN in order to improve the feature expression ability of the network.

Model loss of CBAM at different positions.
The final output of most image classification tasks uses the Softmax classification function, but in the facial expression recognition task, because the facial features of different expression classification have small intra-class differences and large inter-class differences, the traditional Softmax loss function cannot solve this problem. Therefore, an improved loss function additional marginal Softmax (AM-Softmax) [27] is creatively introduced into the output module of our model.
The proposed DNN-CBAM
After improving the model, we call our final model DNN-CBAM. In DNN-CBAM, the overall features of all input data are extracted by the basic network DNN. The CBAM then processes the feature maps, adaptively assigns weights to the channel dimensions and spatial di- mensions, and finally outputs them through a classification function. The basic network, CBAM and the final AM-Softmax classification function are indispensable, because the basic network focuses on dealing with the global information of pictures, while CBAM highlights the key features of pictures. Finally, AM-Softmax increases the distance between classes and decreases the distance within classes, thus further improving the accuracy of facial expression recognition. These three parts complement each other and improve the function of the network. The structure of DNN-CBAM is shown in Fig. 5.

The structure of our final model: DNN-CBAM.
The datasets used to evaluate the model and experimental setup are described in detail in this section
In this paper, we use the Fer2013 and CK+ [28] datasets to evaluate the performance of our model.
The Fer2013 dataset is the official dataset of the 2013 Kaggle facial expression recognition contest. Most of the images were downloaded from web crawlers and include images of different ages, different angles, partial occlusion, etc. Figure 7 shows a sample of some of the expressions. Fer2013 contains a total of 35887 grayscale images with 48×48 resolution, of which 28709 were for training and 7178 for testing. The dataset contains seven expressions such as anger, disgust, fear, joy, sadness, surprise and neutrality. The corresponding labels and amount of data for each expression are shown in Table 1.

Examples of Fer2013 dataset.
Emotional tags in fer2013 dataset
The CK+ dataset is an important dataset in the field of facial expression classification, which is extended from the CK dataset. The dataset uses unified hardware to collect a total of 593 video sequences from 123 subjects. The CK+ dataset divides expressions into seven categories: anger, contempt, disgust, fear, happiness, sadness, and surprise. Unlike the classification in the Fer2013 dataset, contempt is used instead of calm in the CK+ dataset. Figure 8 shows some of the expression data from the CK+ dataset, with the corresponding labels and amount of data for each expression shown in Table 2.

Examples of CK+ dataset.
Emotional tags in fer2013 dataset
In order to make the most of our limited training data, we improve the data through a series of random transformations so that the model does not see any two identical images during the training process. The specific operation is to rotate the pictures in the dataset randomly, enlarge, cut and transform, horizontal flip, pixel filling and other operations, as show in Fig. 9. These operations serve the purpose of expanding the dataset, thus improving the accuracy of the predictions. In addition, data expansion can force the network to learn more robust features, thus make the model have a stronger generalization ability.

Examples of data enhancement.
During the training process, the enhanced datasets were divided into 5 groups based on image numbering [29]. Cross-validation uses 4 sets of data as the training set and the remaining data as the test set. During the training, we randomly initialized weights and deviations, and performed 120 epochs training in Fer2013 with the batch size set to 32 [30]. Considering that the small size of CK+ dataset, we decide to increase training epochs to 300 for better feature extraction [31]. The learning rate of Adam [32] is 0.001 [33].
The results of the model were evaluated by accuracy and confusion matrix scores. The score is calculated as follows:
TP is the correct result of expression recognition, FP is the wrong result of expression recognition.
The platform and computer configuration used to test the proposed model is shown in Table 3.
The used Platform or computer configurations
Firstly, the proposed model is compared with a series of other models on the Fer2013 dataset, and the recognition accuracy of different models is shown in Table 4. Convolutional neural network combined with support vector machine (CNN-SVM) [35] is the model proposed by Tang, the winner of the 2013 Kaggle facial expression recognition contest. Cao et al. [36] combined CBAM with the classic VGGNet. The specific operation is to insert a CBAM block after each maximum pooling layer in the network, which is named as VGG-CBAM. Compared with the original VGG network, it improves the 1.7% accuracy. Without introducing the improved loss function and attention module, the lightweight model we proposed can achieve a recognition accuracy of 69.3%, which is close to the model proposed by Tang, the champion of the 2013 Kaggle facial expression recognition contest, and slightly lower than the accuracy of the VGG network. Then we conduct a comparative experiment on whether to add CBAM to the basic network. As can be seen from Table 4, after adding CBAM, the recognition accuracy reaches 70.5%, which is 1.2% higher than that of the original network. The improved model is slightly lower than the VGG-CBAM network by 1.1% compared with Tang’s. This is because we only insert one module in order to keep the network lightweight.
Accuracy of different model on the Fer2013
Accuracy of different model on the Fer2013
The experimental results show that CBAM can effectively improve the feature expression ability and recognition rate of the network. In addition, this paper also makes a comparative experiment on the models before and after the improvement of the loss function. It can be seen that compared with the model using Softmax loss function, the recognition rate of the model using AM-Softmax loss function is improved by 1%, which also outperforms the other models. It proves that the AM-softmax loss function can effectively improve the feature recognition ability of facial expressions. Figure 10 shows the variation in accuracy of the three networks during training on the Fer2013 dataset.

Accuracy curves of different models on the Fer2013.
Finally, we compare the proposed DNN-CBAM model with a range of other models on the CK+ dataset, and the recognition accuracy of different models are shown in Table 5. From the table, we can clearly see that the proposed DNN-CBAM model outperforms many other methods. Figure 11 shows the variation of the accuracy of DNN-CBAM during training on the CK+ dataset.
Accuracy of different model on the CK+

Accuracy curves of DNN-CBAM on the CK+.
Tables 6 7 show the normalized confusion matrix of the proposed DNN-CBAM model on two different datasets, respectively. Specifically, on the Fer2013 dataset, high accuracy was obtained for the identification of disgust, happiness, surprise and calm. In fact, they are the easiest human emotions to discern, while Feelings of anger, fear, and sadness are more easily confused together. On the CK+ dataset, each expression can be classified with unerring accuracy. The reason for this large difference be- tween the two datasets is that the emoji images in the Fer2013 dataset were crawled from the internet. There are many obscured and non-frontal or even non-expressive images in the dataset, these interfering pictures prevent the network from extracting recognizable features. The CK+ dataset is a video recording of experimental subjects in the laboratory. Each sequence is classified according to the facial coding system. These expressions are representative and universal, so that the network can adequately extract the features of different expression classifications. Figure 12 shows some of the negative images on the Fer2013 dataset.
Confusion matrix of DNN-CBAM on the Fer2013
Confusion matrix of DNN-CBAM on the CK+

Examples of negative image.
In this paper, we propose a deep convolution neural network model that combines the use of the use of depth separable convolution and residual block. In subsequent work in the study, we introduce an attention mechanism and a new loss function into the model. In this model, the attention mechanism can highlight key features of the images, which allows the neural network to focus more on useful features. The improved loss function AM-Softmax is used to minimize the intra-class differences between facial expression categories. Experimental results show that the method proposed in this paper achieves good performance in facial expression recognition. In future work, we will focus on training the network with more data, more filters, and more depth to improve the accuracy in facial expression recognition.
Footnotes
Acknowledgment
This research was supported by National Natural Science Foundation of China (GRANT No.61871234) “Study of Mechanism and Classification for EEG Emotional Brain network in Teaching Situation”.
Appendix
The abbreviations for all terms used in the paper are shown in Table 8.
