A classroom facial expression recognition method based on attention mechanism

Abstract

Compared with other facial expression recognition, classroom facial expression recognition should pay more attention to the feature extraction of a specific region to reflect the attention of students. However, most features are extracted with complete facial images by deep neural networks. In this paper, we proposed a new expression recognition based on attention mechanism, where more attention would be paid in the channel information which have much relationship with the expression classification instead of depending on all channel information. A new classroom expression classification has also been concluded with considering the concentration. Moreover, activation function is modified to reduce the number of parameters and computations, at the same time, dropout regularization is added after the pool layer to prevent overfitting of the model. The experiments show that the accuracy of our method named Ixception has an maximize improvement of 5.25% than other algorithms. It can well meet the requirements of the analysis of classroom concentration.

Keywords

Deep learning classroom facial expression recognition attention mechanism activation function dropout regularization

1 Introduction

Classroom facial expression is an important way to reflect the listening state of students. However, the division of students’ emotions while in a classroom is the key factor to judge the state of listening [1, 2]. Researchers have analyzed different situations based on different definitions of emotion. The basic emotions proposed by Ekman have been widely recognized and studied by the academic community, but these emotions are somewhat different from those emphasized in learning situations [3, 4]. Thus, a new classification of emotional states need to be defined with considering learning situation.

Most existing facial expression methods are based on the six basic emotions of psychologist Ekman [5], namely happiness, anger, boredom, sadness, fear, and surprise. Ekman proposed a facial action unit (AU) coding system (FACS). The facial action unit includes six upper facial muscle features and nine lower facial features [6]. FACS has been widely used for emotion recognition and has become a very effective tool for emotion analysis. Bu not all of the emotions Ekman described are applicable to the classroom. At present, there is no unified concept for the division of students’ emotions while in a classroom. Researchers have analyzed different situations based on different definitions of emotion. Mello [7] of the University of Memphis monitored the mood changes in students who were learning basic computer skills under the Auto Tutor system. It was found that the six emotional factors mentioned by Ekman were not completely applicable to the analysis of students’ classroom states. Happiness, surprise, boredom, confusion, and frustration occurred more frequently in a classroom, with happiness and surprise occurring more frequently.

The method of face detection is another issue would to be considered. Intuitive observation is a directly method for teacher to know the learning state of students, which limited by the number of students, classroom size and other factors. Therefore, it is particularly important to use artificial intelligence technology to develop an efficient method for detecting students’ emotional states while learning in a classroom environment. Since face detection is a relatively mature technology, which can analyze the features of students’ facial expressions so as to accurately explore their emotional states and to effectively detect their current learning states.

Chen et al. [8] proposed an intelligent human-computer interaction system based on the online classroom, where position algorithm was used to estimate head posture, Haar-like features was used to recognize facial expression, and multi-modal information was adopted to identify learners’ emotional states. Based on the displacement of facial feature points and local texture differences, a support vector machine was used to recognize facial expressions. However, support vector machines are difficult to implement on large-scale training samples. By identifying students’ classroom behaviors,locating facial feature points and estimating head pose angles, Huang et al. [9] proposed a deep convolutional neural network (D-CNN) combined with cascaded facial feature point location method to analyze and recognize head poses and facial expressions. That means the initial position of the facial feature was estimated and the feature information was extracted with extracting the head pose. Lee et al. [10] proposed a process-focused assessment (PFA) method where a deep neural network model was used to learn facial expressions in order to conduct real-time assessments of students’ learning and learning processes. The concept of machine learning based on a DNN model was adopted, and the open library Keras was used to implement the network. Adaboost and Haar-based methods were used to extract facial features. Summarizing the above convolutional neural networks, the recognition of facial expressions were divided into three categories: simple, neutral, and difficult, whose efficiency was low.

Wu et al. [11] designed a program to analyze students’ facial expressions. Opencv software was used for face capture, two different CNN models were integrated to recognize students’ facial expressions, and the EMQX message protocol was used to transmit the facial expression information. Talegaonkar et al. [12] proposed a facial expression recognition system developed by CNN, which could classify facial expressions in real time through a webcam. The image was preprocessed by gray transform, normalization, and image size adjustment. The Haar cascade classifier was used for face detection, and a convolutional neural network was used to recognize facial expressions. Wang et al. [13] proposed a framework combining a compact deep learning model based on the architecture of CNN with an online course platform. Students’ facial expressions were analyzed from the perspective of computer simulation. The facial expression recognition (FER) algorithm was used to analyze the students’ facial expressions and to divide them into eight emotions. Pabba et al. [14] proposed a real-time monitoring system for student group participation using convolutional neural networks to recognize students’ facial expressions, which included two modules: offline and online. The offline module was a CNN-based trained FER model, while the online module ran in real time to estimate students’ engagement using the CNN model trained in the offline module. Multiple FER models needed to be trained for different populations. The above methods are all belong to standard convolution, where features are extracted from the complete face image rather than the regional characteristics that should be observed. Moreover, large amount of calculations and large number of parameters are also needed.

Different from the above, Li et al. [15] proposed a novel end-to-end automatic facial expression recognition network with an attention mechanism. This method combined local binary pattern (LBP) and convolution features with the attention mechanism to improve the performance of the network. The LBP feature can improve network performance by extracting texture information from images and capturing tiny movements of faces. The attention mechanism can make the neural network pay more attention to useful features. They combined the LBP characteristics with the attention mechanism to enhance the attention model in order to obtain better results. Minaee et al. [16] gave an over-view of modern augmented reality technology from application-level and technical perspectives, including about 100 promising machine learning-based work systems developed for AR. AR-based education platforms are in high demand after the COVID-19 pandemic, and augmented reality can serve as a great tool in education.

Taking into account six basic emotions and foreign studies on classroom emotions, this paper defines students’ facial expressions as listening, thinking, understanding, yawning and wandering [17, 18]. This definition was more in line with teaching practices. Based on Xception [19, 20], this paper adopt deep separable convolution to reduce the number of model parameters and improves the efficiency of model parameters [21, 22]. In addition, the rational use of attention mechanism module plays an important role in improving students’ expression recognition in class [23]. We introduces attention mechanism module to improve channel information, suppress useless information, solve the loss problem caused by different importance ratio of channel information, and improve the activation function and accuracy of network identification.

The contributions of this paper are as follows:

The Xception convolutional neural network with a high recognition rate was selected as the basic model.

The classification of facial expressions in a classroom is put forward.

The attention mechanism module was added to the model to focus on useful information. The activation function was improved, the number of parameters and amount of calculations of the model were reduced, and dropout regularization was used during training to simplify the network structure and to prevent overfitting of the model.

The accuracy of the proposed method in this paper was improved by 5.25% compared to Resnet50 [24], 3.11% compared to Inceptionv3 [25], 2.71% compared to MobileNet v1 [26], and 2.22% compared to Xception. The recall rate and F1 score were also improved. The accuracy has also improved in the public datasets FER2013, CK+, AffectNet, RaFD. The algorithm effectively improved the accuracy of expression recognition.

2 Related works

2.1 Xception

The Xception network is an improvement to Inception v3 [27] proposed by Google, which is divided into three parts: entry, middle, and exit. The input layer is mainly used for continuous down sampling to reduce the spatial dimension. The middle constantly learns correlations and optimizes the features. The final output is the summary and collation of features, which are used to express for the full connection layer. Feature extraction is based on 36 convolution layers, which are structured into 14 modules, including 4 for entry flow, 8 for middle flow, and 2 for exit flow. All modules except the first and last modules have linear residual connections [28] around them. The specific structure is shown in Fig. 1.

Fig. 1

The architecture of Xception.

The Xception network is a kind of convolutional neural network based on depthwise separable convolution. At the same time, the residual structure is introduced to reduce the difficulty of network training, to accelerate convergence of the network, and to solve the problem of disappearance of the gradient in the deep network. Depthwise separable convolution can be decomposed into two smaller operations: depthwise convolution and pointwise convolution [29]. Deep convolution carries out the convolution operation on each channel without changing the depth of the input feature image, so as to obtain the output feature graph with the same number of channels as the input feature image. Point-by-point convolution carries out the convolution by raising and lowering the dimensions of the feature graph. Depthwise separable convolution has lower parameters and operational costs than conventional convolution [30]. An example of depthwise separable convolution is shown in Fig. 2.

Fig. 2

Depthwise separable convolution.

2.2 Squeeze and excitation block

The main function of the squeeze and excitation block (SE Block) is to allocate each channel and to help the network to learn important feature information. The block is divided into three steps: squeeze operation, excitation operation, and fusion operation. The squeeze operation makes use of global pooling, carries out feature compression along the spatial dimension, and turns each two-dimensional feature channel into a real number [31].

After the squeeze operation, the network only obtains a global description, but this description cannot be used as the weight of the channel. Therefore, the excitation operation is proposed, which can obtain channel-level dependence and satisfy the ability to learn non-exclusive emphasis.

After the excitation operation, the weights of each channel in the input feature graph U are obtained, and then, the weights are fused with the initial features.

The module structure is shown in Fig. 3. The feature map size is H × W × C. F_sq (·), which is representative of the squeeze operation. F_ex (· , W) represents the excitation operation. F_scale (· , ·) represents the fusion operation.

Fig. 3

SE block architecture.

2.3 The mish function

An activation function is usually introduced to enhance the nonlinear function of the convolutional neural network. The Mish function [24] is mathematically defined as follows: $f (x) = x \cdot \tanh (\ln (1 + e^{x}))$ (1)

The ReLU activation [32] function has the problem of a disappearing gradient when the input is negative, while the Mish function is a smooth, non-monotone activation function, which can make gradient descent smoother and can improve the recognition accuracy at the same time.

3 Classroom facial expression recognition method based on attention mechanism

At present, most mainstream CNNs extract expression features from the whole face image and treat all regions in the image uniformly. They cannot focus on the input image to classify relevant regional features of the image, which will result in a waste of information processing resources and weakens the expression ability of the features. SE-Net integrates channel attention into the convolutional module, which significantly improves the performance of multiple deep CNNs. To prevent excessive network parameters from affecting network performance, SE blocks were only added to the first three parts of the network. The ReLU activation function was replaced by the Mish activation function, and the linear operation was used to strengthen the feature representation and to reduce the number of parameters and computations of the model.

This paper focuses on the Xception network architecture. After each depthwise separable convolution, an extrusion excitation module was added to the entry flow. After depthwise separable convolution, the input image was first pooled and calculated as follows: $Z_{c} = F_{sq} (u_{c}) = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} u_{c} (i, j)$ (2) where z ∈ R^c is generated by shrinking U through its spatial dimensions H × W.

Next comes the excitation operation, which consists of two fully connected layers and the Sigmoid activation function. The calculation is as follows: $s = F_{ex} (z, W) = σ (g (z, W)) = σ (W_{2} δ (W_{1} z))$ (3) where z is the global description obtained by the extrusion operation and δ refers to the ReLU function, $W_{1} ɛ R^{\frac{C}{r} \times C}, W_{2} ɛ R^{C \times \frac{C}{r}}$ . When setting r = 16 in ResNet-50, the accuracy and complexity reached a balance.

After the above operations, the weights of each channel in the input feature graph U were obtained. Finally, the weights and original features were fused. The calculation is as follows: $\tilde{X} = F_{scale} (u_{c}, S_{c}) = S_{c} \times U_{c}$ (4) where $\tilde{X} = [\tilde{x_{1}}, \tilde{x_{2}}, . . ., \tilde{x_{c}}]$ and F_scale (u_c, S_c) refer to the channel-wise multiplication between the scalar S_c and the feature map u_c ɛR^H×W.

Finally, the feature map was obtained and passed into the next layer. The middle flow received the incoming feature graph of the entry flow for three convolution operations. After eight middle flow processing operations were repeated, the exit flow received the processed feature graph. After adjusting the number of channels through two depthwise separable convolution operations, maximum pooling was performed to change the size. The fusion structure was passed into the global pooling layer using the residual network. Figure 4 shows the entry flow in this paper. As can be seen from Fig. 4, the output of the depthwise separable convolutional layer was input into the SE module, and the output was then sent to the batch normalization layer, which can help the network to further focus on and select information conducive to facial expression classification.

Fig. 4

Entry flow in this paper.

4 Results

4.1 Data set

FER2013 consisted of 35886 images of faces with different expressions, each 48 by 48 pixels in size. We collected images from FER2013 that fit the classification. The data set in this paper divided expressions into five categories, including listening, thinking, understanding, mind wandering and yawning. For these five categories of emotional states in a classroom, their simple facial expression features and corresponding classroom behaviors were given, as shown in Table 1. In the process of model training, a data set with a large amount of data can suppress the overfitting phenomenon, so we extended the dataset by traditional data enhancement, including rotation and mirroring, as shown in Fig. 5. After data enhancement, listening had 688 images, thinking had 150 images, understanding had 594 images, yawning had 204 images, and mind wandering had 342 images. Finally, 1978 images were obtained and divided into the training and test sets according to a ratio of 8 : 2.

Table 1
Classroom emotional state classification

Emotional States Features of Facial Expression Corresponding Classroom Behaviors

Listening The expression is relaxed, with no obvious change. Listening attentively to the teacher, or understanding and grasping the teaching content. A neutral expression is presented.

Thinking The brows are furrowed and lowered. Thinking about a knowledge point.

understanding The cheekbones are prominent, and the upper lip is raised. Thorough understanding of a knowledge point.

yawning The mouth is wide open, the lips are protruding, and the eyes are closed. Not being well rested or being unable to understand the knowledge points presented.

mind wandering The eyes are dull, the corners of the mouth are pulled down. Lacking interest in the knowledge.

Emotional States	Features of Facial Expression	Corresponding Classroom Behaviors
Listening	The expression is relaxed, with no obvious change.	Listening attentively to the teacher, or understanding and grasping the teaching content. A neutral expression is presented.
Thinking	The brows are furrowed and lowered.	Thinking about a knowledge point.
understanding	The cheekbones are prominent, and the upper lip is raised.	Thorough understanding of a knowledge point.
yawning	The mouth is wide open, the lips are protruding, and the eyes are closed.	Not being well rested or being unable to understand the knowledge points presented.
mind wandering	The eyes are dull, the corners of the mouth are pulled down.	Lacking interest in the knowledge.

Fig. 5

Data set facial expression examples.

4.2 Experimental results and analysis

The experimental environment of the algorithm in this paper included: a 64-bit Windows 10 operating system, a Python3.8 programming language experiment, a TensorFlow2.6.2 deep learning framework, and an Nvidia GeForce RTX 2080Ti graphics card. The batch-size was set as 16 in the training stage, and Adam was used as the optimizer. Five rounds of pre-training were conducted, with a learning rate of 10^–3. A total of 150 rounds of training were conducted, and the initial learning rate was 10^–4. In order to accelerate network convergence and to prevent network overfitting, dropout regularization was added to the network. The network was pre-trained using the ImageNet data set.

To verify the effectiveness of the proposed algorithm, a comparison experiment was conducted with other algorithms on the data set. The comparison results are shown in Table 2.

Table 2
The identification results of different methods on the data set

Method Acc(%) Recall F1_score

Resnet50 90.23 0.9023 0.9025

Inceptionv3 92.37 0.9241 0.9244

MobileNet v1 92.77 0.9275 0.9277

Xception 93.26 0.9334 0.9334

Ixception 95.48 0.9546 0.9546

Method	Acc(%)	Recall	F1_score
Resnet50	90.23	0.9023	0.9025
Inceptionv3	92.37	0.9241	0.9244
MobileNet v1	92.77	0.9275	0.9277
Xception	93.26	0.9334	0.9334
Ixception	95.48	0.9546	0.9546

By analyzing Table 2, we can see that compared with other methods, the method proposed in this paper achieved a good recognition effect. The accuracy of the proposed algorithm was improved by 5.25% compared to Resnet50, 3.11% compared to Inceptionv3, 2.71% compared to MobileNet v1, and 2.22% compared to Xception. Compared with the other algorithms, the recall rate and F1 score were also improved. It can be seen that the method proposed in this paper could learn facial features well and had higher recognition accuracy.

To verify the network’s performance, ablation experiments were performed on the data set. The methods of the ablation experiments are shown in Table 3, including whether the SE module was added into Xception network and whether the activation function was improved. By adding the SE module to the model and improving the activation function, key information in the facial features could be effectively selected so as to improve the recognition accuracy of the network model to a certain extent. Table 2 shows that Acc, Recall and F1_score of the Ixception network model were 95.48%, 0.9546, and 0.9546, respectively, which were 2.22%, 0.0212, and 0.0212 higher than those of the Xception network model. Table 4 shows after the SE module was added to the Xception network model, ACC, Recall and F1_score were increased by 1.31%, 0.0128, and 0.0128, respectively, which had the greatest impact on the performance of the network model. After improving the activation function, the network model was improved by 0.91%, 0.0084, and 0.0084, respectively. Therefore, the addition of the SE module and improvement in the activation function greatly contributed to the improvement in the network model’s recognition performance.

Table 3

Three methods of ablation experiments

Method	Model	Whether to add the SE module	Whether to improve activation function
1	Xception	×	×
2	Xception	√	×
3	Ixception	√	√

Table 4

Evaluation of different ablation methods on the data set

Method	Acc	Recal	F1_score
1	93.26%	0.9334	0.9334
2	94.57%	0.9418	0.9418
3	95.48%	0.9546	0.9546

4.3 Page numbers and running headlines

To verify the effectiveness of the proposed algorithm, a comparison experiment was conducted with other algorithms on the data set. The comparison results are shown in Table 2.

In order to verify the effectiveness of the proposed algorithm, the experimental results of Xception and Ixception on AffectNet, FER2013 and CK+ are compared. Compared with the Xception algorithm, the accuracy of expression recognition obtained by the Ixception algorithm proposed in this paper are all improved to some extent.As shown in Table 5, the effectiveness of the proposed algorithm is verified by experiments.

Table 5
Results on public datasets

Data Set Model Acc

CK+ Xception 95.0%

Ixception 96.7%

AffectNet Xception 61.2%

Ixception 62.3%

FER2013 Xception 73.2%

Ixception 76.1%

RaFD Xception 91.1%

Ixception 93.9%

Data Set	Model	Acc
CK+	Xception	95.0%
	Ixception	96.7%
AffectNet	Xception	61.2%
	Ixception	62.3%
FER2013	Xception	73.2%
	Ixception	76.1%
RaFD	Xception	91.1%
	Ixception	93.9%

The confusion matrix can reflect the recognition accuracy of each type of expression and the misclassification of other types. The confusion matrices obtained by the Xception and Ixception networks through the test experiment on the data set are shown in Figs. 6 and 7, where the rows of the matrix represent the real recognition results and the columns represent the predicted recognition results.

Fig. 6

Evaluation index of different ablation methods on data set.

Fig. 7

The confusion matrix of Xception.

As can be seen from Figs. 7 and 8, the accuracy of detecting each type of facial expression was improved by the algorithm in this paper. As can be seen from the above figure, each type of facial expression was slightly confused with the others, but listening, understanding, and yawning could be easily distinguished with an accuracy of more than 95%.

Fig. 8

The confusion matrix of our method.

The recognition rates for thinking and yawning were low, and some pictures were assigned to the wrong category. These features were based on the eyes, while the rest of the facial features were more subtle, which led to the prediction being misclassified as understanding and mind wandering. Yawning and mind wandering share some similar characteristics. Students will occasionally hold their faces in their hands when they are distracted and they also easily cover their mouths with their hands when yawning, leading to confusion. Another reason for this phenomenon is that there were fewer training pictures of thinking and yawning, and the samples to learn were limited, so their recognition rates were lower than the those of the other three types.

5 Conclusions

In order to solve the problem that convolutional neural networks cannot focus on the regional features of students’ facial expressions in a classroom from images, this paper proposed an attention-mechanism-based facial expression recognition method based on the Xception network. The attention mechanism was introduced to adjust the proportion of channel information, to improve the ability of the network to extract important features from students’ facial expressions, and to learn the features of students’ facial expressions more resonably. During training, dropout regularization was used to simplify the network structure and to prevent overfitting of the model. The experimental results on the data set presented in this paper showed that, compared with the benchmark network model and other classical networks, the proposed method effectively improved the performance of classroom students’ facial expression recognition. In future work, the problem of sample imbalance in the experimental data and how to optimize the algorithm more suitably for students’ facial expression recognition will be further studied.

Footnotes

Acknowledgments

This work is supported by Guangxi Key Laboratory of Trusted Software (KX202315); Industry-University-Research Innovation Foundation of Chinese Universty (2021LDA06003); Provincial Graduate Student Innovation Ability Training Funding Project of Hebei Provincial Education Department (CXZZSS2023058). Hebei Normal University Teaching Reform Project (2023XJJG049).

References

Ekman

, Friesen

W.V.

Facial action coding system: A technique for the measurement of facial movement, A Technique for the Measurement of Facial Action 1978.

Plutchik

A general psychoevolutionary theory of emotion-science direct, Theories of Emotion 1980:3–33.

Mello

, Taylor

R.S.

Monitoring Affective Trajectories during Complex Learning, Proceedings of the Annual Meeting of the Cognitive Science Society 2007:203–208.

Chen

, Luo

, Liu

et al. A hybrid intelligence-aided approach to affect-sensitive e-learing, Computing: Archives for Informatics and Numerical Computation 98 (2016), 215–233.

Wei Huang , Ning Li , Zhijun Qiu et al. An Automatic Recognition Method for Students’ Classroom Behaviors Based on Image Processing, Traitement du Signal 37(3)(2020), 503–509.

Lee

H.-J.

and Lee

, Study of Process-Focused Assessment Using an Algorithm for Facial Expression Recognition Based on a Deep Neural Network Model, Electronics 10(1)(2021), 54.

, Ravi

Real Time Facial Expression Recognition for Online Lecture, Wireless Communications and Mobile Computing 2022.

Talegaonkar Isha, , Joshi Kalyani, , Valunj Shreya, et al. Real Time Facial Expression Recognition using Deep Learning, Social Science Research Network 2019.

Wang

, Xu

, Niu

et al. Emotion Recognition of Students Based on Facial Expressions in Online Education Based on the Perspective of Computer Simulation, Complexity 2020.

10.

Pabba

, Kumar

An intelligent system for monitoring students’ engagement in large classroom teaching through facial expression recognition, Expert Systems 2022.

11.

Jing Li , Kan Jin , Dalin Zhou et al. Attention mechanism-based CNN for facial expression recognition, , Neurocomputing 411 (2020), 340–350.

12.

Minaee

, Liang

, Yan

Modern Augmented Reality: Applications, Trends, and Future Directions 2022.

13.

Kaiser

, Gomez

A.N.

, Chollet

Depthwise separable convolutions for neural machine translation, arXiv 2017, arXiv:1706.03059.

14.

Dong

, Zhang

and Fan

, A Multi-View Face Expression Recognition Method Based on DenseNet and GAN, Electronics 12 (2023) https://doi.org/10.3390/electronics12112527

15.

Trabelsi

, Alnajjar

, Parambil

M.M.A.

, Gochoo

and Ali

, Real-Time Attention Monitoring System for Classroom: A Deep Learning Approach for Student’s Behavior Recognition, Big Data Cogn Comput 7 (2023), 48 https://doi.org/10.3390/bdcc7010048

16.

Trabelsi

, Alnajjar

, Parambil

M.M.A.

, Gochoo

and Ali

, Real-Time Attention Monitoring System for Classroom: A Deep Learning Approach for Student’s Behavior Recognition, Big Data Cogn Comput 7 (2023), 48. https://doi.org/10.3390/bdcc7010048

17.

Savchenko

A.V.

, Savchenko

L.V.

and Makarov

, Classifying Emotions and Engagement in Online Learning Based on a Single Facial Expression Recognition Neural Network, in IEEE Transactions on Affective Computing 13(4) (2022), 2132–2143. doi: 10.1109/TAFFC.2022.3188390.

18.

La Grutta

, Epifanio

M.S.

, Piombo

M.A.

, Alfano

, Maltese

, Marcantonio

, Ingoglia

, Alesi

, Lo Baido

, Mancini

and Andrei

, Emotional Competence in Primary School Children: Examining the Effect of a Psycho-Educational Group Intervention: A Pilot Prospective Study, Int J Environ Res Public Health 19 (2022), 7628. https://doi.org/10.3390/ijerph19137628

19.

Xception Network for Weather Image Recognition Based on Transfer Learning, 2022 International Conference on Machine Learning and Intelligent Systems Engineering (MLISE), Guangzhou, China, 2022, pp. 330–333. doi: 10.1109/MLISE57402.2022.00072

20.

Akgül

İ.

, Kaya

and Zencir Tanır

Ö.

, A novel hybrid system for automatic detection of fish quality from eye and gill color characteristics using transfer learning technique, PLoS ONE 18(4)(2023), e0284804. https://doi.org/10.1371/journal.pone.0284804

21.

22.

Gao

, Gong

, Ding

, Guo

Image Recognition Based on Mixed Attention Mechanism in Smart Home Appliances, 2021 IEEE 5th Advanced Information Technology, Electronic and Automation Control Conference (IAEAC), Chongqing, China, 2021, pp. 1501–1505. doi: 10.1109/IAEAC50856.2021.9391092 571

23.

Szegedy

, Vanhoucke

, Ioffe

et al. Rethinking the inception architecture for computer vision, IEEE 2016:2818–2826.

24.

Adnan

M.M.

et al. Automated Image Annotation With Novel Features Based on Deep ResNet50-SLT, in IEEE Access 11 (2023), 40258–40277. doi: 10.1109/ACCESS.2023.3266296.

25.

Zhao

, Xie

, Zou

and He

J.-B.

, Intelligent Recognition of Fatigue and Sleepiness Based on InceptionV3-LSTM via Multi-Feature Fusion, in , IEEE Access 8 (2020), 144205–144217. doi: 10.1109/ACCESS.2020.3014508.

26.

Kadam

, Ahirrao

, Kotecha

and Sahu

, Detection and Localization of Multiple Image Splicing Using MobileNet V1, in IEEE Access 9 (2021), 162499–162519. doi: 10.1109/ACCESS.2021.3130342.

27.

, Zhang

, Ren

et al. Deep Residual Learning for Image Recognition, IEEE Conference on Computer Vision & Pattern Recognition, IEEE Computer Society 2016.

28.

Cheng

, Zhou

, Wang

and Wen

, Emotion-Recognition Algorithm Based on Weight-Adaptive Thought of Audio and Video/, Electronics 12 (2023), 2548. https://doi.org/10.3390/electronics12112548

29.

Wang

, Yuan

, Xu

and Wen

, CSDS: End-to-End Aerial Scenes Classification with Depthwise Separable Convolution and an Attention Mechanism, IEEE J Sel Top Appl Earth Observ Remote Sens 14 (2021), 10484–10499. [Google Scholar] [CrossRef]

30.

Cai

, Zhang

and Jiang

, Power Quality Disturbance Classification Based on Parallel Fusion of CNN and GRU, Energies 16 (2023), 4029. https://doi.org/10.3390/en16104029

31.

Zhu

et al. MS-HNN: Multi-Scale Hierarchical Neural Network With Squeeze and Excitation Block for Neonatal Sleep Staging Using a Single-Channel EEG, in, IEEE Transactions on Neural Systems and Rehabilitation Engineering 31 (2023), 2195–2204. doi: 10.1109/TNSRE.2023.3266876.

32.

Akbari Asanjan

, Memarzadeh

, Lott

P.A.

, Rieffel

, Grabbe

Probabilistic Wildfire Segmentation Using Supervised Deep Generative Model from Satellite Imagery, Remote Sens 15 (2023), 2718. https://doi.org/10.3390/rs15112718

A classroom facial expression recognition method based on attention mechanism

Abstract

Keywords

1 Introduction

2 Related works

2.1 Xception

4.1 Data set

Table 2 The identification results of different methods on the data set Method Acc(%) Recall F1_score Resnet50 90.23 0.9023 0.9025 Inceptionv3 92.37 0.9241 0.9244 MobileNet v1 92.77 0.9275 0.9277 Xception 93.26 0.9334 0.9334 Ixception 95.48 0.9546 0.9546

Table 5 Results on public datasets Data Set Model Acc CK+ Xception 95.0% Ixception 96.7% AffectNet Xception 61.2% Ixception 62.3% FER2013 Xception 73.2% Ixception 76.1% RaFD Xception 91.1% Ixception 93.9%

Footnotes

Acknowledgments

References

Table 2
The identification results of different methods on the data set

Method Acc(%) Recall F1_score

Resnet50 90.23 0.9023 0.9025

Inceptionv3 92.37 0.9241 0.9244

MobileNet v1 92.77 0.9275 0.9277

Xception 93.26 0.9334 0.9334

Ixception 95.48 0.9546 0.9546

Table 5
Results on public datasets

Data Set Model Acc

CK+ Xception 95.0%

Ixception 96.7%

AffectNet Xception 61.2%

Ixception 62.3%

FER2013 Xception 73.2%

Ixception 76.1%

RaFD Xception 91.1%

Ixception 93.9%