Abstract
Generally, emotions in short videos are transmitted through characters behaviour and text content in the video. The mainstream recognition model is based on a convolutional neural network. However, its processing ability for video data is limited, its recognition accuracy is not high enough, and it is limited in processing complex text. To solve these problems, a new text recognition emotion classification model is designed by improving the text feature fusion module on the basis of the convolutional neural network model. Moreover, a new human behaviour recognition emotion classification model is designed by introducing multi-head attention mechanisms on the basis of a 3D convolutional neural network model. These results confirmed that the accuracy of the improved text recognition model was around 75%, while the original convolutional model's average accuracy was only about 67%. The 3D convolutional model under the multi-head attention mechanism had the highest recognition accuracy, with a recognition accuracy of 94.5% on the UCF101 database, which was 12 to 39 percentage points higher than the model under other attention mechanisms. The improved convolutional network model for short video text classification and behaviour recognition has more advantages than traditional models. These research results have certain value for classification models and can serve as technical references.
Keywords
Introduction
Computer vision technology has rapidly developed, and short videos have gradually become an indispensable part of people's lives. These make short video sentiment recognition and classification based on short video data analysis a current research hotspot. Usually, emotions in short videos are disseminated through the behaviour status of the characters in the video and the text content in the video. 1 As a method using convolutionl operations, the convolutional neural network (CNN) is applied widely in image recognition, object detection, face recognition, natural language processing, and other fields. 2 In terms of image recognition, CNN can classify, segment, and detect image information. In terms of text processing, CNN can recognize, associate, and classify text content. 3 However, traditional CNN also has problems. When dealing with behavioural feature recognition, the use of an average sampling strategy to convert video data into image frames leads to information loss, resulting in insufficient recognition accuracy. Additionally, it is limited in processing complex text, making it difficult to process lengthy and complex text due to grammar constraints. 4 In this context, this study attempts to make improvements based on traditional CNN, attempting to improve the feature fusion module in order to obtain a new text recognition and classification system. The study attempts to combine multi-head self-attention (MSA) with 3D CNN to obtain a new behaviour recognition classification model.
This study consists of four parts. Firstly, the current research status of convolutional models in text and behaviour recognition by experts was reviewed. Secondly, the methods to be used in this study were introduced. Then, based on the research method in the second part, experiments were conducted and the results were analyzed. Finally, a summary of the research results was provided.
Related works
With computer technology developing and people's increasing demand for information, the method of human behaviour analysis in videos has gradually shifted from manual analysis to deep machine learning analysis of algorithms. Dual flow networks are commonly used algorithms in the field of behaviour analysis. Tang J et al. believed that fusion methods were crucial for the performance of dual flow networks. Therefore, a new time fusion structure was proposed to fuse the information of dual flow joints for feature recognition and prediction of human motion. The structure included a time connection block and a reinforcement trajectory spatiotemporal block, which could increase the time continuity between the first predicted pose and the given pose, as well as between each predicted pose. The short-term and long-term predictions of H3.6 M, CMU-Mocap, and 3DPW benchmark datasets achieved better results than the improved model. 5 For improving the feature fusion structure of the dual stream algorithm, A X Z et al. proposed an end-to-end refined dual stream fast region CNN algorithm. Based on the fast region CNN algorithm, RGB image features and depth image features were first extracted separately. Then, regions of interest were generated by mapping two types of image features. Subsequently, feature extraction and fusion were carried out using a feature fusion layer. The average recognition accuracy of this model was around 95%. Moreover, compared to relevant methods, its average accuracy was higher. 6 Peng C et al. proposed a three-stream model using two different types of deep CNNs, using two optical flow estimation methods: an Epicflow (Edge preserving interpolation correspondence for optical flow) method that emphasizes motion boundaries, and a learning optical flow estimation method. Accurate optical flow was crucial in human motion recognition. If some measures of global behaviour of motion were included, generalization performance typically was improved by 1% −2%. 7
Sentence semantic matching, as a core task of Natural-language understanding technology, widely exists in various interaction scenarios. Lu W et al. designed a novel human-machine language interaction method based on a 3D CNN module. Given a pair of sentences, the representations were connected together and fed into another 3D CNN module to capture the interaction features between them, thereby generating the final matching representation. Numerous experiments conducted on two real datasets showed that this model could achieve performance comparable to or even better than state-of-the-art competitive methods. 8 Wang X et al. believed that the purpose of sentiment analysis was to identify the words’ emotional polarity. While existing research has predominantly employed CNN, issues such as truncation backpropagation and others frequently arise during training. Therefore, Wang X et al. put forward a model combining multiple attention mechanism (AM) for analyzing hierarchical sentiment. This experiment utilized multi-layer AM, including intra layer AM and inter layer AM, to get sentences’ hidden meanings. In inter layer AM, global attention was used to capture the interaction information between context and aspect words. Compared with the baseline model, this new method could provide optimal outcomes. 9 Gan et al. posited that the inherent complexities of natural language, including its diverse semantics, multifaceted emotional polarities, and other intrinsic factors, continue to present significant obstacles to the advancement of emotional analysis methods. Therefore Gan C et al. put forward an extended joint architecture of scalable multi-channel CNN and bidirectional long short-term memory (CNN-BilSTM) model, and combined AM to analyze the emotional orientation. In addition, this algorithm introduced a loss function, which could avoid categories’ imbalance when training. This model's accuracy and Macro-F1 on one corpus were improved by more than 1.19% and 0.9%, respectively. Its accuracy and F1 on another corpus was improved by more than 1.7% and 1.214%, respectively. 10
In summary, experts are constantly researching topics related to behaviour recognition and text classification. They proposed behaviour recognition and text recognition models based on CNN and dual flow networks and continuously improved them in the feature fusion module. However, CNN-based behaviour recognition models still have the problem of easily losing behaviour data, and current text recognition and classification models also have the problem of being difficult to handle complex and lengthy sentences. Therefore, it is worth researching how to design a model that can recognize complex text in short videos and accurately recognize human behaviour in videos.
Emotional communication in short video based on 3D CNN and MSA
In visual analysis of video data, the recognition of emotions in videos mainly includes human motion recognition within the video and emotion recognition of short video text content. 11 At present, the main CNN models used have limited processing capabilities for video data. When processing behavioural feature recognition, information loss may occur due to the use of an average sampling strategy to convert video data into image frames, resulting in insufficient recognition accuracy and limitations in processing complex text. 12 To address these issues, a new text recognition emotion classification model and a human behaviour recognition classification model are designed based on CNN models.
Design of a text emotion recognition model based on feature fusion strategy
To achieve higher accuracy in short video text relationship classification and extract important feature information, it needs to apply as many local features as possible in global relationship classification. CNN, as an application of deep learning, is essentially a process of finding the optimal weight value, which can be described by Equation (1).
When the model classifies text, it is first necessary to use Equation (4) to represent this sentence.
Considering that the text length is not fixed, the most important feature information of the corresponding feature layer will be captured through the maximum pooling operation. By pooling the global maximum value, two-dimensional data of feature layer is transformed into a one-dimensional vector. The pooled vector's length is equal to channels number in feature layer. This can achieve that each channel only retains the maximum value of data. This process is described by Equation (7).
Although the emotional text feature information in sentences can be captured through maximum pooling operation, the feature information that can be extracted by a single convolutional layer is limited. To enhance the quantity of relationship extraction, it is essential to consider not only augmenting the volume of feature information following convolution but also incorporating the feature information of the training word vector into the convolutional network. This is because the training of word vectors is based on a text convolutional network model, which is rich in semantic information. Figure 1 shows the improved model structure.

Improved text convolutional network model.
Compared with the traditional text sentiment classification convolutional network model, the new model improves the feature fusion module. Firstly, it is necessary to average the words that have already been represented as vectors in a sentence to obtain an average vector that can represent all words. Then, after activating the relu function,
When evaluating the modified model performance, indicator accuracy, recall rate, and F-value are introduced. Due to the fact that accuracy and recall rates are contradictory, it is necessary to introduce an F-value for comprehensive consideration in model detection, represented by Equation (11).
14
The principle of capturing human motion information in a three-dimensional convolutional model is to convolution the part formed by the superposition of multiple channels and a three-dimensional convolutional kernel. This enables feature mapping to connect from the convolutional layer to adjacent frames in the previous layer to obtain human motion status. 3D convolution process is described by Equation (12).
15

Structure of human behaviour recognition model.
This study considers that traditional AM structures can cause heavy computational burden when processing large images due to the excessive number of nodes in the image. MSA
16
is used in this study to address this issue. The advantage of MSA is that it can improve model performance. By simultaneously focusing on multiple parts, the model can better understand the key information in the input data, thereby improving the accuracy and robustness of the model. MSA uses multiple query vectors
Figure 3 shows the structure diagram of the information fusion module. In the information fusion module, fusion operations are performed on spatiotemporal features and spatiotemporal scores. After passing through the spatiotemporal feature extraction module and MSA module, the short video segments generate feature vectors. The information fusion module performs score fusion on the feature vectors of different streams to predict classification. Then, the softmax classifier module is inputted to generate prediction scores. Finally, the global spatiotemporal prediction scores are fused to obtain the final prediction results of human motion state information.

Structure of the information fusion module.
The word vectors required come from Word2vec. It is an open-source training word vector library from Google, which can convert a given word in a sentence into a vector to represent it through word vector technology. 17 UCF101 database and HMDB51 database are used in the experiment, and the input video included Sc fragments, with each fragment containing Im frames. 18 When a video with less than 90 frames appears, it is necessary to copy the last frame of the image until it reaches 90 frames. The AM time step parameter is 10. The evaluation indicators are evaluation accuracy, recall rate, and the harmonic average based on the F-Score principle. The accuracy of UCF101 (%) is implemented on the UCF101 database and HMDB51 (%) is implemented on the HMDB51 database.
Performance simulation results and analysis of short video text sentiment classification model
For a sentence x with a given m words, using word vector technology, the sentence can be represented as: each word is represented as a real value vector
Key parameter settings.
Key parameter settings.
Figure 4 shows the accuracy changes of the original text CNN and the improved text CNN in text extraction under the 60-word vector dimension. Accuracy is used as an evaluation indicator to represent the model's detection accuracy for samples that have already been detected, considering the influence of positional factors. From Figure 4, the accuracy of both the new and old models decreases as the detection results increase. The improved model has an accuracy of about 75%, while the average accuracy of the original model is only about 67%. After detecting 300 samples, the accuracy of the improved model remains at 73.2%, which is more than 10 percentage points higher than the original model.

Trend of precision changes.
Figure 5 shows the trend of recall rate changes between new and old models during the experimental process. As the number of tested results increases, the overall recall rate of both new and old models is increasing. But when the detected results exceed 250, the recall rate of the original model decreases, below the peak of 76.3%. In comparison, the improved model remained stable in the experiment, with an average recall rate of 85.2%.

Recall rate of old and new models.
Figure 6 shows the trade-off data of F-Score values between the new and old models under average accuracy and average recall. The average F-value of the improved model is about 6 percentage points higher than the original model, and more abundant short video text information is introduced by feature fusion strategies. The improved convolutional network model for short video text recognition has more advantages than traditional models.

Comparison of F-values between new and old models under different sample sizes.
The designed model is tested for human motion recognition emotion classification performance under different video clip sizes and image frame numbers. From the data listed in Table 2, the system has the lowest recognition accuracy on UCF101 and HMDB51 databases when video clip number is 4 and image frame number is 5, with 71.6% and 49.2%, respectively. When video clip number is 5 and image frame number is 20, the system has the highest recognition accuracy on the UCF101 database, at 94.8%. When video clip number is 6 and image frame number is 15, the system has the highest recognition accuracy on the HMDB51 database, at 71.5%.
Model recognition accuracy data for different values of Sc and Im.
Model recognition accuracy data for different values of Sc and Im.
Figure 7 shows the trend of the recognition accuracy change of the model when Sc is taken as 4, 5, and 6 during the process of Im value changing from 5 to 20. From Figure 7, as image frame number increases, this model's overall recognition accuracy on both databases increases. When the image frame number is 5 or 10, the overall recognition accuracy of the model on the UCF101 database increases by about 10 percentage points as the video clip number increases, while the recognition accuracy on the HMDB51 database increases by 6 to 10 percentage points. When image frame number is high, ranging from 15 to 20, the increase in video clip number affects this model's recognizing accuracy lightly. Although the overall trend is on the rise, the increase is slightly small, with recognition accuracy increasing by 0.1 to 3 percentage points on both databases.

Trend of recognition accuracy change of the model.
To compare the recognition accuracy of the model under different AMs, the recognition accuracy of the model is tested under the conditions of 6 video clips and 20 image frames. The recognition accuracy on UCF101 and HMDB51 databases is achieved in the experiment when SE block (Squeeze and Excitation), CBAM (Convolutional block attention module), MSA, and without AM are used. Figure 8 shows the changes in recognition accuracy of the model under different AMs. From Figure 8, compared to recognition models with other AMs, the model under MSA has the highest recognition accuracy, with a recognition accuracy of 94.5% on UCF101 database, which is 12 to 39 percentage points higher than models under other AMs. The recognition accuracy on HMDB51 database is 71.6%, which is 1 to 19 percentage points higher than the model under other AMs.

Recognition accuracy of the model under different attention mechanisms.
Identification accuracy data of four algorithms.
To detect whether the model based on 3DCNN and incorporating AM has improved performance, the model designed in this study is compared with commonly used methods in behaviour recognition. They include IDT algorithm (improved dense trajectories, IDT), two stream convolutional networks, and 2D CNN (Temporary Segment Networks, TSN). The average recognition accuracy of different methods under different image frame numbers is compared in the experiment. Table 3 shows the recognition accuracy data of four algorithms. When the image frame number is 20, the average recognition accuracy of the model designed in this study is the highest at 94.3% on UCF101 database. On UCF101 database, the highest average recognition accuracy of this model is 71.5%.
Figure 9 shows the trend of recognition accuracy changes for the four algorithms. On UCF101 database, this designed model's overall recognizing accuracy is higher than other behaviour recognition algorithms. When image frames number is 20, its average recognition accuracy is 94.3%, which is 0.7 to 6 percentage points higher than other algorithms. On HMDB51 database, the recognition accuracy of the 2D convolutional network at low image frames is about 9 percentage points higher than the recognition accuracy of the model designed in this study, which is 63.6%. However, overall, the recognition accuracy of the model designed in this study increases as image frames number increases. When image frames number is 20, its recognition accuracy is 71.5%, which is 3 to 14 percentage points higher than other algorithms. Moreover, its average recognition accuracy is 65.6%, which is generally better than other algorithms.

Trends in recognition accuracy of four algorithms.
In the text recognition and behaviour recognition tasks of short video sentiment analysis, traditional CNN is limited in processing complex text information, and processing behaviour data is prone to information loss. In this study, based on the CNN model, a new text recognition emotion classification model was designed by improving the text feature fusion module, and a new human behaviour recognition emotion classification model was designed by introducing MSA on the basis of the 3DCNN model. The results confirmed that the improved text recognition model's accuracy was about 75%, while the average accuracy of the original convolutional model was only about 67%. After detecting 300 samples, the accuracy of the improved model remained at 73.2%, which was more than 10 percentage points higher than the original model. The average F-value of the improved model was about 6 percentage points higher than that of the original model. Compared to recognition models with other AMs, the 3D convolutional model under MSA had the highest recognition accuracy, with a recognition accuracy of 94.5% on the UCF101 database, which was 12 to 39 percentage points higher than models under other AMs. The recognition accuracy on the HMDB51 database was 71.6%, which was 1 to 19 percentage points higher than the model under other AMs. On the UCF101 database, the overall recognition accuracy of the designed model in this study was higher than other behaviour recognition algorithms. The average recognition accuracy was 94.3% when the image frame number was 20, which was 0.7 to 6 percentage points higher than other algorithms. Overall, it is better than other algorithms. However, the selection of video databases in this study is not diverse enough, and it does not consider the existence of fuzzy videos in real life. This is also a direction that can be studied in the future.
Statements and declarations
Funding
The research is supported by: General research topics of preschool education in Shandong province “A study on parents’ park-choosing intention in Shandong province based on TPB theory” Periodic results, Project number: 2022XQJY047.
Conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
