Abstract
In remote intelligent teaching, the facial expression features can be recorded in time through facial recognition, which is convenient for teachers to judge the learning status of students in time and helps teachers to change teaching strategies in a timely manner. Based on this, this study applies machine learning and virtual reality technology to distance classroom teaching. Moreover, this study uses different channels to automatically learn global and local features related to facial expression recognition tasks. In addition, this study integrates the soft attention mechanism into the proposed model so that the model automatically learns the feature maps that are more important for facial expression recognition and the salient regions within the feature maps. At the same time, this study performs weighted fusion on the features extracted from different branches, and uses the fused features to re-recognize student features. Finally, this study analyzes the results of this paper through control experiments. The research results show that the algorithm proposed in this paper has good performance and can be applied to the distance teaching system.
Introduction
The main purpose of behavior recognition is to analyze the types of behaviors performed by people in the video, which involves knowledge and technology in many fields such as computer vision, image processing, pattern recognition, and feature engineering. In the recognition process, the algorithm not only needs to locate and track the behavior from the image sequence, but also needs to analyze the state and trend of the movement, and infer the specific behavior category of the person based on this. Compared with images, video has a larger amount of information. Before the era of deep learning, behavior recognition was a difficult subject. In traditional machine learning, because the image or video feature extraction process is manually designed and the video has a high information content, the progress of behavior recognition research is slow. With the advent of the era of deep learning and the substantial improvement of computing power, behavior recognition has been paid attention by a large number of researchers and has achieved good results. Some objective factors of human behavior are also important reasons that hinder the development of behavior recognition, such as: individual behavior execution differences: even the same behavior, different people will produce a variety of differences when performing; change of shooting angle: different shooting angles will output different observation data, and the same motion will also produce large differences; as well as camera shake, lighting changes, and complex background. With the progress of society and the rapid development of economy, the state and society are more and more concerned about public safety, so a large number of monitoring systems are deployed in highways, schools, shopping malls and other places [1]. A large number of monitoring equipment will inevitably generate a large amount of video data, which will occupy a lot of storage resources. At the same time, these monitoring devices only provide a storage function and cannot perform intelligent analysis, so it takes a lot of manpower and material resources to perform video analysis when needed. Secondly, if an abnormal event occurs in the monitoring scene, it cannot be reported in real time, which results in the failure to take effective measures in time and delays a lot of time. Therefore, building an effective behavior recognition system plays a vital role in the field of public safety. In addition, behavior recognition has high practical value in the field of virtual reality [2]. Virtual reality is the use of a computer to generate a virtual three-dimensional world, and by collecting human sensory information such as human movement, body language, and sound in real time, it allows users to be on the scene and observe the created three-dimensional world without restrictions. Virtual reality technology has also been applied to medical, military, and educational aspects, helping these fields achieve good results. In addition, behavior recognition is also widely used in human-computer interaction, video retrieval and other fields [3].
Related work
Behavior recognition is one of the hot research topics in the field of computer vision, and its research topic “behavior” has multiple definitions. The literature [4] believed that Action is usually a simple action mode performed by one person and lasts for a short time (around tens of seconds), while Activity is composed of a complex sequence of actions performed by several people and several people interact with each other in a specific way. For example, running or swimming belongs to human actions, and two people shaking hands or two teams playing football matches belongs to human activities. The literature [5] defined the term “action primitive” as atomic motion that can be described at the limb level. Therefore, behavior consists of a series of behavior primitives, and a series of behaviors constitute more complex activities. For example, running behavior consists of a series of behavior primitives such as raising arms and legs, and running, kicking and other behaviors can constitute football activities. The literature [6] believed that the key to behavior is that it will bring changes or transforms to the surrounding environment. The definition in the literature [7] is more reasonable, but it needs to be clear that human behavior has a relatively clear start and end time, and can be classified as an accurate category. Multiple continuous behaviors constitute human activities, and the number of people is not the main basis for distinguishing between behaviors and activities. The main challenge of behavior recognition is that even the same behavior category has a large difference in appearance and posture. In the field of machine learning, feature extraction is an important link, so it becomes particularly important to convert behavior-containing videos into feature vectors. Behavior recognition feature extraction includes two types: global features and local features. Global features are extracted from the entire process of human behavior, which provides rich and discriminative motion information. In the whole process, background modeling needs to be carried out first, and the foreground objects of interest in each frame are extracted for feature coding to obtain global characteristics of motion behavior. The literature [8] proposed two methods of motion energy image (MEI) and motion history image (Mill), and the purpose is to encode the complete dynamic human behavior into a single image. The motion energy image is a binary image template that describes where motion occurs. The motion history image shows the time at which the motion occurred, and each pixel is related to the motion history at that location, and a larger value indicates that the motion has occurred more recently. In addition to extracting motion information from image frames, optical flow is also proposed as complementary information for image frames. Optical flow information is extracted from consecutive image frames, and it refers to calculating the movements in the horizontal and vertical directions under the premise of unchanged lighting conditions [9]. Local features only focus on the partial regions where significant motion occurs, describe the local motions that occur in the significant regions, and finally use the features in the significant regions as feature vectors. The classic methods include the recognition method based on spatiotemporal interest points [10] and the method based on motion trajectory proposed in literature [11].
In recent years, with the improvement of computer computing power and the driving of big data, deep learning methods have been widely used in computer vision, natural language processing and other fields, and researchers have gradually adopted deep learning theory to solve behavior recognition tasks. Since Convolutional Neural Network (CNN) has made important breakthroughs on ImageNet dataset, CNN-based behavior recognition method is proposed [12]. The literature [13] proposed the concept of a three-dimensional convolution kernel, which is used to simultaneously model temporal and spatial information in video. The literature [14] tried a variety of fusion methods based on the CNN model to fuse image features into video behavior features. The literature [15] proposed a dual-stream network model to separately learn the spatial and temporal information in the video and then fuse the classification probabilities of the two networks to obtain the final classification result. In addition, some top journals in the field of artificial intelligence and pattern recognition such as TPAMI, IJCV, PRL, etc. have opened behavior recognition columns. Moreover, relevant international conferences such as CVPR, ICCV, ECCV, etc. provide relevant researchers with the opportunity to display and communicate cutting-edge algorithms. In order to solve practical problems in industry, many behavior recognition related competitions have emerged, such as TRECVID event retrieval competition [16], THUMOS behavior recognition competition [17], YouTube. 8M video classification competition [18], etc.
The expression feature extraction algorithms based on static images can be divided into two categories: global feature extraction methods and local feature extraction methods. In terms of global feature extraction, the commonly used feature extraction algorithms include Principal Component Analysis (PCA), Active Appearance Model (AAM) and Independent Component Analysis (ICA). The literature [19] applied the PCA algorithm to the facial expression recognition task and verified the feasibility of PCA in the field of facial expression recognition through experiments. The literature [20] used principal component analysis algorithm to reduce the dimensionality of the original image feature matrix and used Linear Discriminant Analysis (LDA) projection matrix to classify the dimensionality-reduced features to recognize facial expressions.
Mixed attention subnetwork
At present, researchers have tried to use different methods to reduce the impact of cluttered background information on the effect of pedestrian re-recognition [21]. Moreover, the use of image semantic segmentation technology to obtain the pedestrian’s subject area is also a method to reduce the negative impact of cluttered background. For pedestrian re-identification, binary mask information has two advantages:
The mask can remove the cluttered background at the pixel level. In this way, the pedestrian re-identification model is more robust against different backgrounds [22].
The mask contains the shape information of the human body, which is a very important feature. According to previous studies, the human body mask is very robust to different lighting and clothing colors, which is conducive to identifying pedestrians. With the rapid development of deep learning, image semantic segmentation technology has become relatively mature, such as full convolutional neural network (FCN), Mask-RCNN and so on. Moreover, in terms of data, there are already large-scale human semantic segmentation data sets. Therefore, through these semantic segmentation techniques, an ideal body mask can be obtained, and the background can be accurately removed from the pedestrian image, as shown in Fig. 1 [23–25].

Pedestrian image and corresponding mask image.
(2) Contrast spatial attention
Usually, the spatial attention model takes the intermediate feature map of the network as input and outputs the weight map. According to this weight map, the spatial attention operation is performed on the feature map. Through such operations, the network can notice the regions on the feature map that contribute the most to the model training effect.
The input sample RGM-M is given, and the feature map fstage-2 after two stages in the hybrid contrast attention network is obtained. fstage-2 is used as the input of the spatial attention sub-network to obtain the human attention map. This process can be expressed as:
Among them, σ (x) = 1/(1 + exp(- x)) is the sigmoid function, and W and b are the weight and deviation of the convolution kernel, respectively. Then, a contrasting attention graph Φ- is generated and it pays attention to the opposite features. In order to ensure that Φ+ and Φ- are a pair of contrasting attention maps, for each position (i, j), this attention map needs to satisfy the following constraints:
These two attention maps are used separately on feature map fstage-2 to produce a pair of contrasting feature maps:
Among them, ⊗ means element multiplication. For
Among them, M is a body mask generated in advance from a pedestrian image by an appropriate image semantic segmentation method, which is adjusted to the same size as the attention map. In this way, the spatial attention sub-network can generate a pair of contrast feature maps related to the human body and the background, respectively.
(3) Contrast channel attention
There are some problems when using contrast spatial attention alone: Since the mask information is a single channel and the features of the hidden layer of the convolutional neural network are usually multi-channel, it is necessary to copy the learned spatial attention map to make it into multi-channel data. Since the feature map of the middle layer of the network has passed through several neural network layers, the relationship between certain spatial features and the weights of different channels has been learned. When using such single-channel attention ignoring the channel relationship and the feature map of the middle layer of the network to perform spatial attention operations, although the feature map obtains better spatial feature weights, the learned channel weights are lost to a certain extent.
This section uses a simple example to illustrate the effect of single-channel spatial attention on channel weights. The average of a channel’s feature map is usually used to represent the weight of the channel, such as global average pooling. This article assumes a feature map with 2 channels and g size. The separate feature maps of the two channels are 1, 0, 2, 0 and 1, 0, 0, 0, and the weight ratio of the two channels is 3 to 1. Now, we assume that the spatial attention of a single channel is 1, 0, 0, 0, and after the spatial attention is element-multiplied with the feature maps of the two channels, the feature maps of the two channels become 1, 0, 0, 0 and 1, 0, 0, 0. At this time, the channel weight ratio is 1 : 1. In this case, the spatial attention of the single channel damages the channel weights that the network has learned. Therefore, this paper uses both channel attention and spatial attention in the design of the network. Moreover, channel attention can not only eliminate the influence of spatial attention on the channel weights of feature maps, but also enable feature maps to learn better channel weights.
Since spatial attention is a pair of contrasting attention, this article also designs it as a pair of contrasting attention in the design of channel attention. The mixed attention sub-network of the entire process of channel attention is compared. After two MSCANs, an intermediate feature map fstage-2 of size 96 × 16 × 40 is obtained. After that, global average pooling is performed to obtain 96-dimensional feature vectors, namely:
Among them, W and H respectively represent the length and width of the pixel level of the feature map, and pix (i, j) represents the feature value at position (i, j).
After that, z
c
successively passes through two fully connected layers with 96 neurons to obtain 96-dimensional characteristic sales volume f
z
c
. Finally, f
z
c
gets the positive channel attention Ψ+ through the sigmoid activation function:
Among them, σ (x) = 1/(1 + exp(- x)) is the sigmoid function.
The attention of the pair of contrast channels is adjusted to 96 × 16 × 40, and the adjusted channel contrast attention is used on the intermediate feature map fstage-2:
Among them, ⊗ means element multiplication. Since ⊗ is a multiplication of elements, both
(4) Mixed contrast attention
The comparison space attention feature map f
att
and the comparison channel attention feature map fc-att are obtained, and the subsequent work is to fuse the two. Since both spatial attention and channel attention are processed by the sigmoid function and both are attention operations performed on the feature map fstage-2, the two are equivalent in magnitude. Therefore, we can directly use element addition to fuse two attention feature maps:
Among them, ⊕ means adding elements.
In the model, the overall branch remains unchanged and the intermediate feature fstage-2 is still used. However, the human branch and the background branch use mixed contrast attention feature maps
This section introduces the objective function of the model and the entire algorithm flow of training. After the attention operation, the feature maps
Among them, m is the boundary parameter. By minimizing this loss function, in the feature space, the features from the global branch and the human branch will be drawn closer, but the features of the global branch and the background branch will be pushed away. Finally, the characteristics of the overall branch will be more concentrated on the human body area than the cluttered background, which can improve the effect of pedestrian re-identification.
While using the region-level ternary loss function for the three branches, a softmax operation is performed on the last layer of the three branches to predict the identity information of pedestrians. Furthermore, the loss function is denoted as L
id
using the cross-entropy loss function. The overall loss function of FCANet is:
Among them, α and β are hyperparameters.
After training FCANet, it is used as the basic model of the twin network to continue training to reduce the distance between features generated by the same pedestrian and increase the distance between features generated by different pedestrians. Among them, the weights of the two networks are shared. After the RGB-M inputs of pedestrians p and g are given and the pedestrian feature vectors generated by the overall branch are recorded as h (p) and h (g), respectively, the loss function of the twin network is defined as:
Among them, m is the boundary parameter. According to previous work, in order to improve the effect of pedestrian re-identification, the contrast loss function and the identity loss function are used simultaneously. Subsequently, FCANet’s regional-level loss function is added. At this time, the overall loss function becomes:
Among them, λ is a hyperparameter.
(1) Parameter design
The expansion ratios of the three parallel expansion convolutions of each MSCAN are set to 1, 2, and 3, respectively, and the number of convolution kernels is 32. In the last two fully connected layers, the number of neurons in the first fully connected layer is 128, and the number of neurons in the second is the number of pedestrians in the training set. Moreover, the L2 regularization parameter is set to 0.005.
The three parameters in the loss function: λ α β are set to 0.01, 0.01, and 0.1, respectively. The boundary value m is set to 100. The initial learning rate of the mixed contrast attention network is set to 0.01. During the training process, it iterated about 75,000 times, and the learning rate is reduced every 15,000 times. In the training process of the twin network, the initial learning rate is set to 0.0001, and the learning rate is reduced every iteration 15000 times. During the training process of the twin network, the initial learning rate is set to 0.0001.
(2) Training process
The input data is preprocessed. For three-channel images: For the value pix (i, j) of each pixel (i, j) of each channel in the image, subtract it from the mean and divide by 255:
For the mask image, the grayscale image is read, and then the same normalization operation as the three-channel image is performed. After that, the pre-processed three-channel image and mask image are spliced by channel as the input of the network.
First, FCANet is trained, and the training is stopped after 75,000 iterations. After that, training of the twin network was started, and the trained FCANet is used as the basic model of the twin network to continuously train the twin network until it converged. The feature vector f full generated by the overall branch is used for pedestrian re-recognition. The entire process is shown in Fig. 2.
(3) Feature distance measurement method

Twin network.
After the model training is completed, the pedestrian images in the probe set and the gallery set are sent to the model to obtain feature vectors. The distance or similarity between the probe image and the gallery image feature according to the feature vector is calculated, the gallery images are sorted according to the similarity, and the sorted gallery image and probe image are matched respectively. In this paper, the feature distance measurement method is Re-rank.
First, the distance between image feature vectors is calculated by using Euclidean distance, and the images in the gallery are sorted for different probe images. Re-rank is to reorder the sorted gallery image sequence

Training process diagram.
The top k samples with the highest similarity to the probe image p in the original sorted sequence
N (p, k) sample is: |N (p, k) | = k. According to N (p, k), the interrelation set of probe image p and gallery image is defined as:
According to the above definition, it can be seen that the connection between the gallery image and the probe image in R (p, k) is larger than that in N (p, k).
However, there is a problem with this definition. Due to changes in lighting, pedestrian poses, and viewing angles, the positive samples in the gallery dataset may belong to N (p, k) but not R (p, k). In order to solve this problem, the robustness of R (p, k) is increased and extended to R * (p, k):
After expansion, there are more positive samples in R * (p, k), and these newly added samples are more similar to the probe image in comparison with the original samples in R (p, k).
The coincidence degree of the mutual relationship set of the two images represents the similarity of the two images to a certain extent, that is, the more coincidence, the higher the similarity of the two images. Therefore, a new measure can be defined based on the relationship between the two images-the Jaccard distance:
According to the interaction set of the image, the feature vector of the image is re-encoded:
Among them,
According to the encoding method of V
p
, it can be seen that each sample in R * (p, k) has the same weight. However, defining weights based on the similarity to the image should result in better ranking results, that is, samples with higher similarity should obtain greater weights. Therefore, the Gaussian kernel improvement formula (22) according to a pair of sample distances is:
Among them, d (p, g i ) represents the distance between the image p and g i measured by using the original measurement method, and Euclidean distance is used in this paper. Through this improvement, the hard 0 and 1 weights are improved to soft weights. In soft weights, the closer to the probe image, the higher the weight.
After the coding method of V
p
is improved, the calculation method of Jaccard distance will also change accordingly. In Jaccard distance:
Among them, the minimization and maximization operations are performed on the elements in the vector, and ∥ · ∥ 1 represents L1 normalization. The final calculation method of Jaccard distance is:
Since the images of the same pedestrian will have similar characteristics, the final coding features of the extended image are:
Considering that the original distance measurement method also has a certain effect, the Jaccard distance and the original distance are mixed as the final measurement distance:
Among them λ ∈ [0, 1].
This research model is subjected to performance analysis, and this research model is named ML-VR.
First, from the characteristics of the data set, it can be seen that the expression intensity of the image sequence in the CK+dataset is gradually increased from the natural expression to the strongest expression intensity, and the expression intensity of the image sequence in the Oulu-CASIA dataset is also gradually changed from the natural state to the strongest expression intensity. Therefore, based on the ensemble method of regression trees method, the three frames with the strongest expression strength are detected by Landmarks, and finally the face is aligned and normalized to 48×48 pixels. Then, ML-VR is used to extract facial expression features from the pictures. When the model proposed in this paper is used for learning, deep features with better discriminability and robustness can be obtained. Among them, the balance parameters λ and λ1 are set to 0.008, α1 and α2 are set to 0.1 and 1.0, respectively, and ξ is 2.0. During the test, only the ML-VR model was used to extract 512-dimensional depth features. Finally, the extracted features are input to the SoftMax classifier for classification. Because the number of facial expressions included in the CK+dataset and the Oulu-CASIA dataset is different, the SoftMax classifier is used in the CK+dataset to divide the extracted features one of the seven categories: angry, disgusting, happy, sad, surprised, afraid, and despising. However, the features extracted in the Oulu-CASIA dataset are divided into one of six categories: angry, disgusted, happy, sad, surprised, and afraid. In order to verify the effectiveness of the soft attention mechanism and the regularized center loss respectively, four sets of comparative experiments are carried out.
(1) The experiment only uses SoftMax classification loss as the objective function to optimize the multi-scale convolutional neural network without fused soft attention mechanism-MSCNN+Ls (2) The experiment uses only SoftMax classification loss as the objective function to optimize the multi-scale attention convolutional neural network with soft attention mechanism-ML-VR + Ls (3) The experiment uses the center loss and SoftMax classification loss as the objective function to optimize the multi-scale convolutional neural network with soft attention mechanism-ML-VR + Ls + Lc(4) The experiment uses regularized center loss and SoftMax classification loss as the objective function to optimize the multi-scale attention convolutional neural network with soft attention mechanism-ML-VR + Ls + LRC. First, the confusion matrix of the model is analyzed, and the results are shown in Tables 1, 2, Figs. 4 and 5:
ML-VR recognition rate confusion matrix in CK+
ML-VR recognition rate confusion matrix in CK+
ML-VR recognition rate confusion matrix in Oulu-CASIA

The statistical diagram of ML-VR recognition rate in CK+.

The statistical diagram of ML-VR recognition rate in Oulu-CASIA.
From the confusion matrix in the table above, it can be seen that the three expressions of despising, disgusting and fear may have recognition errors. The reason is that when the objects in the database present these three expressions, the mouth area or the eye area may change similarly. In addition, people will open their mouths unconsciously when they are afraid and happy, so these two expressions will also be confused. Finally, the three expressions of sadness, anger and disgust will also make a judgment error. The reason for this is that the areas of mouth, eyes and eyebrows of the three expressions are similar and have similar changes. Compared with other comparative experiments, the multi-scale attention convolutional neural network model incorporating the soft attention mechanism has achieved relatively excellent recognition results in the test data set in the CK+database. In particular, its recognition accuracy for expressions such as anger, contempt, fear, happiness, sadness is high. However, the facial expression recognition algorithm proposed in this paper in the Oulu-CAISA database has relatively good recognition effect on expressions such as disgust, fear, sadness, happiness and surprise.
Finally, this research method is applied to the distance teaching classroom, and the research method is compared with the CNN model. In the model recognition, a total of 60 sets of experimental recognition are performed, and the results of the recognition accuracy are shown in Table 3 and Fig. 6.
Statistical table of classroom performance identification accuracy (%)

Statistical diagram of model classroom performance identification accuracy.
It can be seen from Fig. 6 that the accuracy of this research model in class performance identification is close to 90%, and compared with the CNN model, this research model has a greater advantage.
The traditional re-recognition algorithm for student performance in the classroom requires manual design and extraction of features in the image. Moreover, the effect of student re-recognition depends heavily on the design and extraction of features, and algorithms often fail to achieve satisfactory results. As deep learning has achieved tremendous results in image classification, deep learning has been introduced into classroom re-recognition, and the effect of student feature re-recognition has been significantly improved. Since spatial attention is one-to-one contrasting attention, in the design of channel attention, this article also designs it as one-to-one contrasting attention. The mixed attention sub-network of the entire process of channel attention is compared. The mixed contrast attention in this paper can divide the image into two parts at the pixel level according to the importance of the features, and the human body branch and the background branch extract features from these two image parts, respectively. Therefore, in order to make full use of image features, this paper performs weighted fusion of the features extracted from different branches and uses the fused features to re-recognize student features. In addition, this study analyzes the results of this article through a controlled experiment. The research results show that the performance of the algorithm in this paper is good and in line with expectations.
