Sports action recognition algorithm based on multi-modal data recognition

Abstract

The recognition of sports action is an important research subject, which is conducive to the improvement of athletes’ own level. To improve the accuracy of multi-modal data action recognition, based on the Transformer module, this study introduces a multi-head attention mechanism, fuses multi-modal data, and constructs a multi-stream structured object relationship inference network. Based on PointNet++ network and combining five different data fusion frameworks, a motion recognition model that integrates RGB data and 3D skeleton point cloud is constructed. The results showed that the Top-1 accuracy of multi-stream structured object relationship inference network was 42.5% and 42.7%, respectively, which was better than other algorithms. The accuracy of the multi-modal fusion model was improved by 15.6% and 5.1% compared with the single mode, and by 5.4% and 2.6% compared with the dual mode, which showed its superiority in the action recognition task. This showed that the fusion of multi-modal data can provide more abundant information, so as to improve the accuracy of action recognition. The accuracy of the action recognition model combining RGB data and 3D skeleton point cloud was 84.3%, 87.5%, 90.2%, 90.6% and 91.2% after the combination of different strategies, which effectively compensated for the problem of missing information in 3D skeleton point cloud and significantly improved the accuracy of action recognition. With a small amount of data, the Top-1 accuracy of the multi-stream structured object relationship inference network in this study was superior to other algorithms, showing its advantages in dealing with complex action recognition tasks. In addition, the action recognition model that fuses RGB data and 3D skeleton point cloud also achieved higher accuracy, which is better than other algorithms. This study can meet the needs of motion recognition in different scenarios and has certain reference value.

Keywords

Multi-modal data action recognition transformer RGB data 3D skeleton point cloud

1. Introduction

In the field of sports, scientific and accurate analysis of athletes’ technical movements is the key to improve training effect, prevent sports injury and improve competition results. However, traditional motion analysis methods often rely on manual observation and subjective judgment, which has problems such as low efficiency and poor accuracy [1]. With the rapid development of computer vision and artificial intelligence technology, sports action recognition algorithm comes into being. These algorithms can automatically extract the movement features of athletes from video or sensor data, and classify and recognize them, so as to achieve objective and accurate evaluation of athletes’ technical movements. In addition, sports action recognition algorithm also has a wide range of application prospects [2]. For example, in the field of intelligent fitness, by recognizing the user’s movement, it can provide users with personalized fitness guidance and exercise prescriptions. In the live broadcast and playback of sports events, key actions can be automatically identified and marked to improve the spectator experience. In the training and competition of athletes, accurate action recognition and analysis can help athletes improve their technical movements and enhance their competitive level [3]. Therefore, the research of sports action recognition algorithm has important theoretical significance and application value, and is also one of the current research hotspots in the field of sports science and technology [4]. Multi-modal data refers to a variety of different forms of data collected from the same object or scene at the same time, such as image, video, sound, text and sensor data. These data not only provide complementary information, but also enhance the understanding of the target object or scenario through the correlation between them. In the field of sports, multi-modal data is especially widely used [5]. For example, in basketball, by fusing video data and sensor data on players, players’ shooting, running and jumping can be accurately identified, thus providing coaches with scientific training recommendations. In football matches, the combination of image data and physiological data of players can monitor the fatigue degree and movement state of players in real time, so as to help coaches make reasonable substitution strategies. The fusion of multi-modal data can not only improve the accuracy of action recognition, but also enhance the robustness of the system. When the data of one mode is disturbed or missing, the data of other modes can provide supplementary information to ensure the successful completion of the identification task. In addition, multi-modal data can provide more contextual information to help understand the semantics and intent of actions. Therefore, this study is based on the Transformer module, integrating multi-modal data, and constructing a multi-stream structured object relationship inference network (MSORIN). Based on PointNet++ network and combining five different data fusion frameworks, an action recognition model integrating RGB data and 3D skeleton point cloud is constructed to meet the requirements of dynamic action recognition.

This study is divided into four parts. The first part is the research on action recognition algorithms based on RGB data and skeleton by domestic and foreign scholars. The second part constructs an MSORIN and an action recognition model that integrates RGB data and 3D skeleton point clouds. The third part tests and analyzes the model, while the fourth part summarizes the article and proposes shortcomings.

2. Related works

RGB data is commonly used for action recognition, and some scholars have conducted relevant research on the application of RGB data in action recognition. To identify the spatio-temporal information of human actions, Li S et al. constructed a dual-flow structured human action recognition model based on RGB images and optical flow. This model determined the beginning of the video by calculating the joint positions between bone data, and then used RGB images and optical flow to train spatial and temporal networks to predict actions. The information was fused using an average fusion method. The experiment outcomes demonstrated the effectiveness of this method in recognition [6]. Li X et al. proposed a converter-based RGB-D self centered action recognition framework that utilized a self attention mechanism (SAM) to model the time structure of data from different modalities. The features from each modality interacted through the proposed fusion blocks and were combined through simple and effective fusion operations. The test results on a small first person dataset denoted that the proposed action recognition framework outperformed existing models [7]. Cheng J et al. found that the existing action recognition model based on RGB data could not realize the interaction between different modal information, so they proposed a cross-modal compensated convolutional neural network for action recognition. This network jointly learned the compensation features of RGB and deep mode to enhance discrimination ability, and extracted compensation features from RGB and deep mode across modal compensation blocks, which was beneficial for improving the performance of the model. The findings expressed that the model achieved optimal performance on all three datasets [8]. Weiyao X et al. found that existing action recognition methods based on RGB data overlooked semantic relationships. Therefore, a multi-modal action recognition model was constructed based on bilinear pools and attention networks. Firstly, an efficient data preprocessing method was adopted for RGB and skeleton data, and a multi-modal fusion network was constructed by combining RGB video and skeleton sequences. The experimental outcomes indicated that this model could effectively extract RGB and skeleton data, and had better performance compared to other methods [9].

Skeleton data, as the foundation of a 3D action recognition algorithm, can provide more comprehensive information in action recognition, and some scholars have conducted relevant research based on this. Yangzhi L I et al. found that the existing human skeleton action recognition algorithms could not mine the spatio-temporal characteristics of action, so they proposed a skeleton action recognition model based on the spatio-temporal attention graph convolution network model. This algorithm consisted of spatial and temporal attention mechanisms, utilized instantaneous action information of optical flow features to locate spatial regions with significant motion, and automatically extracted time-domain fragments from long-term complex videos. The research outcomes indicated that this algorithm improved the recognition accuracy compared to existing skeleton recognition algorithms [10]. Ahmad T et al. found that graph convolution neural network was the most suitable network for skeleton recognition. However, human skeletons stacked on the length of video sequence would lead to many nodes being very complex. Therefore, a graph sparsization technique using edge effective resistance was proposed to better model global contextual information. Combining self attention graph pooling, local effective information could be preserved and redundant nodes and edges in the graph could be eliminated. The experimental findings illustrated that this method effectively retained useful information, removed the influence of impurities, and improved the accuracy of testing [11]. Zhao J et al. proposed a lightweight and efficient method to solve the problem of low efficiency of most algorithmic efficiency based on skeleton data. This method embedded the graph convolution operator into a simple recursive unit and constructed a graph convolution simple recursive unit. At the same time, to enhance the distinction between nodes, spatial attention network and multi-stream data fusion were used to expand the simple recursive unit of graph convolution to the simple recursive unit of multi space of flows attention graph convolution. The recognition results showed that the testing accuracy of the model under three datasets was 93.1%, 92.7%, and 87.3%, respectively [12]. Kawamura K et al. found that skeleton data must be measured by depth sensors or extracted from video data using estimation algorithms, resulting in extraction errors. To realize skeleton based action recognition, a deep state space model was proposed, which is a deep generative model of potential dynamics of observable sequences. The outcomes denoted that this method improved the classification performance of the baseline method and was superior to the most advanced methods [13].

To sum up, most of the existing action recognition methods study RGB data and skeleton data separately, lacking the research on their fusion algorithm. Therefore, this study combines RGB data and skeleton data to extract multi-modal data, and the constructed action recognition method has high reference value.

3. Construction of a sports action recognition model based on multi-modal data recognition

To better recognize human actions, this chapter is divided into two parts to construct an action recognition model. The first part is based on the Transformer module, introducing a multi-head attention mechanism (MHAM), using the inflated 3D (I3D) network and optical flow algorithm to extract appearance representation data and optical flow data, and integrating them to construct an MSORIN. In the second part, based on PointNet++ network, five fusion frameworks for 3D skeleton point clouds and RGB data are designed to improve the accuracy of action recognition.

3.1 Construction of an MSORIN model based on multi-modal fusion

Sports action recognition technology is mainly used to capture, analyze and understand the human body’s action in sports, and it has a wide range of applications in many fields, including athlete training, sports medicine, fitness, games and so on. Compared with the traditional mechanical, acoustic and electromagnetic action capture technologies, the modern sports action recognition technology has higher precision and real-time performance. Traditional action capture technology requires the installation of sensors or markers on the human body, which not only affects the natural movement of the human body, but also the data processing process is more complex. The action recognition technology based on computer vision can capture human action through the camera, without installing any equipment on the human body, so it is more convenient and practical. Multi-modal data refers to information or data that contains multiple types, usually in different data forms, structures, or representations. Multi-modal data uses methods such as cross-modal learning, multi-modal feature extraction, and fusion to combine different types of data to obtain more comprehensive and accurate information. To improve the correlation recognition of objects in action recognition and distinguish between human behavior and objects, a Transformer-Target relationship inference network (Relation_Transformer) is constructed based on the Transformer module. The network structure is shown in Fig. 1.

Figure 1.

Structure diagram of Relation_Transformer.

The network is divided into three stages: multi – modal feature extraction stage, object relationship inference stage, and multi – modal fusion stage. In the multi – modal feature extraction stage, object detection and multi – object tracking algorithms are used to obtain the geometric position coordinates, RGB data, and optical flow data of the target object [14]. The detection target is set to $O_{t,k}$ . The absolute information position is marked by the bounding box coordinates as $\left[{X_{t,k},Y_{t,k},w_{t,k},h_{t,k}}\right]$ . The relative distance position between $K$ targets and other objects is $K$ . The method for calculating the relative distance position is shown in Eq. (1).

$\displaystyle\Delta d_{1}=\sqrt{(X_{t,k}-X_{t,1})^{2}+(Y_{t,k}-Y_{t,1})^{2}}$ (1)

In Eq. (1), $X_{t,k}$ and $Y_{t,k}$ represent the coordinates of the $k_{th}$ bounding box of the target $O_{t,k}$ , and the equation for calculating the temporal position change information of the target in adjacent frames is shown in Eq. (2).

$\displaystyle\left\{{\begin{array}[]{l}\Delta X_{i,k}=\sqrt{(X_{i,k}-X_{i-1,k}% )^{2}}\\ \Delta Y_{i,k}=\sqrt{(Y_{i,k}-Y_{i-1,k})^{2}}\\ \end{array}}\right.$ (2)

In Eq. (2), $X_{i,k}$ and $X_{i-1,k}$ , $Y_{i,k}$ and $Y_{i-1,k}$ mean the coordinates of adjacent frames. Therefore, the geometric representation feature $[X_{t,k},Y_{t,k},w_{t,k},h_{t,k},\Delta X_{t,k},\Delta Y_{i,k},\Delta d_{1},% \Delta d_{2},\cdots,\Delta d_{k-1}]$ of the target $O_{t,k}$ is obtained. The geometric representation features of the target $O_{t,k}$ are obtained by passing through a fully connected layer, a batch normalization layer, and a linear rectification function to obtain the one-dimensional vector of the target $O_{t,k}$ ’s geometric features. For the extraction of RGB data, the two – stream I3D network is used to extract the appearance characterization feature information of RGB sequences [15]. In the I3D network, it extracts the output feature map FM $\in{\text{R}}^{T/2\times H\times W\times C}$ of the Mixed_4f convolutional layer as the spatio – temporal feature map. It inputs the coordinate information of the spatio – temporal feature map and bounding box into a pooling layer to obtain the appearance representation map $\textit{FO}_{t,k}\in{\text{R}}^{5\times 5\times C}$ of each target. To obtain the one – dimensional feature vector of Relation_Transformer, $\textit{FO}_{t,k}$ is input into a layer of average pooling layer, a layer of fully connected layer, a layer of batch normalization layer, and a layer of linear rectification function to obtain the one – dimensional vector of $\textit{FO}_{t,k}$ ’s appearance representation [16]. To obtain optical flow data, the total variation L1 (TVL1) algorithm is used to calculate the optical flow sequence corresponding to the RGB video sequence and obtain the motion representation features of the motion information. The extracted motion characterization features are input into the pooling layer in the I3D network to obtain a one – dimensional vector of motion characterization features. Transformer network is a neural network model based on SAM, which has attracted widespread attention due to its excellent modeling ability and parallel computing performance, and is gradually applied in fields such as image processing, speech recognition, and action recognition [17]. In the Relationship_Transformer, it includes MHAM, full connection feedforward neural network (FNN), position coding mechanism and mask design rules. The method for calculating the SAM is shown in Eq. (3).

$\displaystyle A(Q,K,V)=\textit{soft}\max\left(\frac{\textit{QK}^{T}}{\sqrt{d_{% k}}}\right)V$ (3)

In Eq. (3), $A(Q,K,V)$ represents the value of the attention calculation process. $d_{k}$ denotes the dimensions of the query vector and key vector obtained from the one – dimensional vector representing geometric features passing through a linear layer, while $\sqrt{d_{k}}$ is used to prevent overfitting caused by rapid gradient descent. $Q$ represents a matrix formed by stacking query vectors, while $K$ and $V$ express a matrix formed by stacking key and value vectors. The schematic diagram of SAM is shown in Fig. 2.

Figure 2.

SAM diagram.

To enhance the network’s ability to infer target relationships, the process calculation expression of the MHAM is shown in Eq. (4).

$\displaystyle\left\{{\begin{array}[]{l}Q_{i}=\textit{YW}_{i}^{Q}\\ K_{i}=\textit{YW}_{i}^{K}\\ V_{i}=\textit{YW}_{i}^{V}\\ h_{i}=A(Q_{i},K_{i},V_{i})\\ \end{array}}\right.$ (4)

In Eq. (4), $Y$ indicates the input sequence obtained from the spatial and temporal arrangement of one – dimensional vectors representing geometric features. $(W_{i}^{Q},W_{i}^{K},W_{i}^{V})$ means a set of trainable linear layers. $h_{i}$ refers to the vector obtained by weighted sum of value vector and corresponding fractional $a_{i,j}$ . The calculation result of MHAM is shown in Eq. (5).

$\displaystyle A_{\textit{mh}}(Q,K,V)=\textit{concat}(h_{1},\cdots,h_{1},\cdots% ,h_{m})W^{o}$ (5)

In Eq. (5), $W^{o}$ indicates the output mapping linear layer parameter matrix, which satisfies the expression $W^{o}\in{\cal R}^{md_{v}\times d}$ . To prevent network degradation caused by excessive layers in the Transformer network, the network structure adopts skip connections [18]. To accelerate the convergence of the network, normalization operation is adopted and Dropout is used to prevent overfitting of the network. The MHAM layer representation after the above operation is shown in Eq. (6).

$\displaystyle\hat{E}(Y)=\textit{Layer Norm}(Y+\textit{Dropout}(A_{mh}(Y)))$ (6)

In Eq. (6), Layer Norm stands for the method of normalizing the input data in the neural network. In the FNN, the output expression of the previous MHAM layer is shown in Eq. (7).

$\displaystyle L(X)=\max(0,(XW_{1}+b_{1})W_{2}+b_{2})$ (7)

In Eq. (7), $W_{1}$ and $W_{2}$ represent two fully connected parameter matrices. Similar to the MHAM layer, normalization, Dropout and hop connection operations are used to obtain the output of the FNN as shown in Eq. (8).

$\displaystyle E(Y)=\textit{Layer Norm}(\hat{E}(Y)+\textit{Dropout}(L(\hat{E}(T% _{2}))))$ (8)

In Eq. (8), $Y$ means the input sequence, and $\hat{E}(Y)$ stands for the output of the multi head self attention layer. For the calculation of position coding, trigonometric functions fixed position coding is used, and the calculation method is shown in Eq. (9).

$\displaystyle\left\{{\begin{array}[]{l}\textit{PE}_{(\textit{pos},2i)}=\sin(% \textit{pos}/10000^{2i/d})\\ \textit{PE}_{(\textit{pos},2i+1)}=\cos(\textit{pos}/10000^{2i/d})\\ \end{array}}\right.$ (9)

In Eq. (9), $\textit{PE}_{(\textit{pos},2i)}$ and $\textit{PE}_{(\textit{pos},2i+1)}$ mean the $2i_{th}$ and $2i+1$ th components of the encoding vector at sequence position pos, respectively. The trigonometric functions position coding can obtain the absolute position information of the coding, and the relative position information of the coding can also be obtained according to the trigonometric functions transformation equation. When the target $O_{t,k}$ disappears in the field of vision, mask design method is used to mark the vanishing position as Falese, shielding the influence of invalid data on self attention and full connection FNN. Using the late fusion strategy to fuse the obtained data of different modalities, the output of the Softmax layer is obtained [19, 20], and the calculation expression is shown in Eq. (10).

$\displaystyle p=\frac{\sum\limits_{n=1}^{N}{p_{n}}}{N}$ (10)

In Eq. (10), $N$ means different branches in Relation_Transformer, and $p_{n}$ denotes the classification probability obtained by different branches passing through the classification layer. After the above operations, Relation_Transformer can capture the relationships and interactions between multiple types of actions, thereby improving the accuracy and stability of action recognition.

3.2 Construction of an action recognition model integrating RGB data and 3D skeleton point cloud

Figure 3.

Position coordinates of joint points in two – dimensional coordinates.

Skeleton sequence stacking is a technique of stacking or combining multiple skeleton sequences. In the field of action recognition, skeleton sequence stacking is widely used to extract and represent action features in human motion. A 3D skeleton point cloud is a type of 3D point cloud data used to represent the motion of a human body or object, consisting of a series of joint positions, each represented by a 3D coordinate point [21]. Each joint position represents the posture information of the human body or object at a specific moment. By connecting multiple joint positions, a hierarchical representation can be formed, reflecting the overall posture and shape of motion. However, there is a problem of disorder in 3D skeleton point clouds, including temporal, spatial and action information [22]. Using the decimal transformation method, a three – dimensional vector $P^{T}$ is used to represent temporal information, and KinectV2 is used to obtain the position coordinates of joint points in two – dimensional coordinates to represent the spatial position information $P^{S}$ of joint points, as shown in Fig. 3.

In Fig. 3, six coordinate points are specifically selected to represent the coordinates of the head, hands, feet, and tail vertebrae of the human skeleton. Based on the fact that human motion is a dynamic process, vector $P^{M}$ is used to represent motion information. Three types of information are integrated and concatenated, and they are input into the skeleton point cloud feature extraction network to obtain global action information. PointNet++ is a deep learning network model used to process point cloud data, which adopts a hierarchical structure to divide point cloud data multiple times from global to local levels and extract features [23, 24]. Therefore, a PointNet++ network is adopted to identify skeleton point cloud features and construct a skeleton point cloud feature extraction network based on PointNet++ network. The framework diagram is shown in Fig. 4.

Figure 4.

Framework of skeleton point cloud feature extraction network based on PointNet++ network.

Due to the ordered input of unordered point cloud data in traditional PointNet networks, the classification results of all data are consistent. To address this issue, PointNet++ has improved on PointNet by using a sampling grouping aggregation strategy to extract local features. Firstly, the farthest point sampling method is used to obtain the set of key points in the skeleton point cloud. The center of the key points is selected and the point set is divided into multiple spherical regions with the same radius. The closest $K$ points within each spherical region with the same distance from the center of the sphere are represented as a separate point cloud, extracting the global feature information of the point cloud. Due to the lack of significant action differences in 3D skeleton point clouds, RGB data is fused with 3D skeleton point clouds to construct five fusion frameworks: input layer, intermediate layer, point cloud guidance attention, feature layer, and decision layer. In the 3D skeleton input layer, it sets the spatio – temporal position information of the skeleton points to $v_{t,j,m}=[x_{t,j,m},y_{t,j,m},z_{t,j,m}]$ , and the RGB three channel data to $c_{t,j,m}=[r_{tx_{t,j,m}},g_{tx_{t,j,m},{y_{t,j,m}}},b_{tx_{t,j,m},y_{t,j,m}}]$ . The skeleton point information $v_{t,j,m}$ in the 3D skeleton sequence is integrated with RGB three channel data $c_{t,j,m}$ to obtain the fused feature input data for feature modeling. The data representation is shown in Eq. (11).

$\displaystyle{v}^{\prime}_{t,j,m}=[x_{t,j,m},y_{t,j,m},z_{t,j,m},r_{tx_{t,j,m}% },g_{tx_{t,j,m},{y_{t,j,m}}},b_{tx_{t,j,m},y_{t,j,m}}]$ (11)

In Eq. (11), $(x_{t,j,m},y_{t,j,m})$ means the corresponding image pixel position in the RGB video. In the middle layer fusion operation, the I3D network is used for RGB feature extraction, and the extracted output feature map is shown in Eq. (12).

$\displaystyle f_{l}^{i3d}\in{\mathbb{R}}^{C_{l}^{i3d}\times T_{l}^{i3d}\times H% _{l}^{i3d}\times W_{l}^{i3d}}$ (12)

After pooling processing, it is fused with the output feature $f_{l}^{\textit{SA}}$ $\in{\mathbb{R}}^{{C_{l}^{\textit{SA}}}\times{N_{l}^{\textit{SA}}}}$ of PointNet++ to obtain the feature vector after intermediate layer fusion as shown in Eq. (13).

$\displaystyle f_{l}^{\textit{fusion}}\in{\mathbb{R}}^{(C_{l}^{\textit{SA}}% \times C_{l}^{i3d})\times N_{l}^{\textit{SA}}}$ (13)

$f_{l}^{\textit{fusion}}$ is utilized as the input feature for the next fusion operation until all fusion operations are completed. When performing intermediate layer fusion operations, the importance of spatio – temporal positions for skeleton points is ignored. Therefore, a point cloud guided attention fusion (PGAF) module is constructed to focus on local features at different positions. Among them, the skeleton point cloud is used for RGB representation feature extraction, with the feature represented as $f_{\textit{rgb}}\in{\mathbb{R}}^{C_{\textit{rgb}}\times T\times H\times W}$ , and the attention mechanism is used to calculate the features in the 3D skeleton point cloud, denoted as $f_{pc}\in{\mathbb{R}}^{C_{\textit{pc}}\times W}$ . The two features are fused, with the key and value vectors obtained through convolutional layer mapping by $f_{\textit{rgb}}$ , and the query vector obtained through $f_{\textit{pc}}$ self attention calculation. In the feature layer fusion operation, the 3D skeleton point cloud features are extracted by the PointNet++ network, and the RGB features are extracted by the I3D network. After fusion through concatenation or mean operation, they are input into the classifier to obtain the classification probabilities of the two types of data, as shown in Eq. (14).

$\displaystyle p=C\left([h_{S},h_{C}]\ \textit{or}\ \frac{h_{S}+h_{C}}{2}\right)$ (14)

In Eq. (14), $h_{S}$ indicates 3D skeleton point cloud features. $h_{C}$ expresses RGB features. $[h_{S},h_{C}]$ means stitching operations; $\frac{h_{S}+h_{C}}{2}$ refers to mean operations. $C(\cdot)$ represents classifiers. When the decision – making level performs fusion operations using the mean method, the calculation method is shown in Eq. (15).

$\displaystyle p=\frac{p_{S}+p_{C}}{2}$ (15)

In Eq. (15), $p_{S}$ indicates 3D skeleton point cloud features, and $p_{C}$ stands for RGB features. The framework flowchart for integrating 3D skeleton point clouds with RGB data is shown in Fig. 5.

Figure 5.

Framework flowchart for integrating 3D skeleton point cloud and RGB data.

4. Action recognition model testing and analysis

To test the performance of the model’s action recognition, this chapter is divided into two parts for testing. The first section tests and analyzes the MSORIN, and the second section tests and analyzes the performance of the action recognition model that integrates RGB data and 3D skeleton point cloud.

4.1 Testing and analysis of MSORIN

The experimental system selected for this experiment was Ubuntu 16.04, with Intel Xeon 4114 CPU, TITAN RTX graphics card, and Python 3.7 software environment. Momentum was set to 0.9, weight attenuation was 0.0005, initial learning rate was 0.01, and epoch number was set to 40. I3D network, temporary shift module (TSM), structured attention fusion for composite action recognition (SAFCAR), and Relationship_Transformer model were selected and tested on the Something_Else dataset. Top-1 and Top-5 accuracy were utilized as measurement indicators, Top-1 and Top-5 accuracy were chosen as metrics to make the results distinguishable and to increase applicability. The results are shown in Fig. 6.

Figure 6.

Top-1 and Top-5 curves of different models.

Figure 6(a) shows the Top-1 accuracy curves of the four models. The Relation_Transformer model had the highest accuracy value of 68.8%. Figure 6(b) shows the Top-5 accuracy curves of the four models. The Relation_Transformer model still had the highest accuracy value of 90.2%. This proves that the motion representation features of motion information contained in optical flow data are important information for action recognition. To test the algorithm’s spatio-temporal feature extraction ability under a small amount of data, 5 and 10 data were selected as samples. The Top-1 accuracy was used as an indicator, and the test results are shown in Fig. 7.

Figure 7.

Top-1 accuracy curve under different data.

Figure 7(a) shows the Top-1 accuracy curves of four models under 5 data sets. The Relation_Transformer model had the highest accuracy, at 42.5%. Figure 7(b) shows the Top-1 accuracy curves of four models under 10 data sets. The Relation_Transformer model still had the highest accuracy value, at 42.7%. Therefore, the accuracy value was higher than other algorithms in a small amount of data, and it still had effectiveness in action recognition. To demonstrate the promoting effect of multi-modal fusion on action recognition, models with single-stream, dual-stream, and multi-stream network structures were compared and tested. The data modes were Boxes, Boxes+RGB, and Boxes+RGB+Flow, respectively. The fusion method was used for 10 tests, and the test results are shown in Fig. 8.

Figure 8 shows the Top-1 and Top-5 accuracy curves of the three modal network structures. The average accuracy values of Boxes single-stream were 56.5% and 86.2%, respectively. The average accuracy values of Boxes+RGB dual-stream were 66.7% and 88.7%, respectively. The average accuracy values of Boxes+RGB+Flow were 72.1% and 91.3%, respectively. Multi-modal fusion could effectively improve accuracy. Compared to single-modal data fusion, multi-modal data fusion has improved accuracy by 27.6% and 5.9%, and compared to dual-modal data fusion, it improved accuracy by 8.1% and 3.8%, respectively.

4.2 Testing and analysis of an action recognition model integrating RGB data and 3D skeleton point clouds

To test the performance of an action recognition model that integrated RGB data and 3D skeleton point clouds, the experimental system selected Ubuntu 16.04, with Intel Xeon 4114 CPU, TITAN RTX graphics card, and Python 3.7 software environment. The NTU RGB+D 60 dataset was selected as the test data. It set momentum to 0.9, weight Attenuation coefficient to 0.0001, epoch to 60, and initial learning rate to 0.1. The action recognition model that combines RGB data and 3D skeleton point cloud (model 1), the Temporal Segment Networks model (model 2), Temporal Relational Reasoning in the videos action recognition model (model 3), Temporal Shift Module action recognition model (model 4) and SlowFast action recognition model (model 5) were compared. The metrics compared include: accuracy, recall, relative error, running time, and F1 value. The details are shown in Table 1.

Table 1
Comparison of indicators between models

Model	Accuracy (%)	Recall (%)	Relative error	Running time (s)	F1 value (%)
Model 1	92.58	91.20	0.003	33.59	90.28
Model 2	58.93	61.24	0.011	38.25	76.94
Model 3	62.14	65.83	0.025	32.74	71.53
Model 4	65.39	69.28	0.113	41.58	69.85
Model 5	70.14	75.91	0.256	52.94	74.96

Figure 8.

Accuracy curves of models with different modes.

In Table 1, the accuracy, recall, relative error, running time, and F1 value of model 1 were 92.58%, 91.20%, 0.003, 33.59s, and 90.28%, respectively, which were significantly higher than the other four models. The results showed that model 1, which combines RGB data and 3D skeleton point cloud, had the best performance. The deep semantics-long short term memory networks (DS-LSTM), spatial-temporary graph convolutional network (ST-GCN), and biosynthetic gene cluster-long short term memory networks (BGC-LSTM) models were chosen for comparative testing with model 1. The results are shown in Fig. 9.

From Fig. 9, the RGB+3D model Top-1 had an accuracy of 81.2%, and Top-5 had an accuracy of 91.2%, which was higher than other algorithms and had good performance in human action recognition. To verify that the accuracy of the fusion model for temporal information, spatial information, and motion information was better than that of single information features, the three information features were combined in pairs to form different dual information fusion models. The test results are shown in Table 2.

Table 2

Accuracy of testing different fusion information models

Fuse information	Top-1 (%)	Top-5 (%)
Baseline (3D skeleton point cloud)	73.5	80.5
3D skeleton point cloud+Time	77.6	85.1
3D skeleton point cloud+Space	76.2	84.3
3D skeleton point cloud+Motion	78.5	86.4
3D skeleton point cloud+Time+Space	79.8	86.9
3D skeleton point cloud+Time+Motion	80.5	88.4
3D skeleton point cloud+Space+Motion	81.3	90.6
3D skeleton point cloud+Time+Space+Motion	82.6	92.7

Figure 9.

Accuracy test curves for different models.

From Table 2, the baseline 3D skeleton point cloud model had the lowest recognition accuracy. After integrating the 3D skeleton point cloud with temporal, spatial, and motion feature information, the accuracy values of Top-1 and Top-5 were 77.6% and 85.1%, 76.2% and 84.3%, and 76.2% and 84.3%, respectively. After combining the 3D skeleton point cloud with feature information, the accuracy values of Top-1 and Top-5 were 79.8% and 86.9%, 80.5% and 88.4%, and 81.3% and 90.6%, respectively, which were improved compared to single feature accuracy. After fusing the 3D skeleton point cloud with three features, the accuracy rates were 82.6% and 92.7%, respectively, achieving the optimal recognition accuracy. Therefore, it was proved that the temporal, spatial, and motion feature information in RGB data could effectively compensate for the problem of missing information in 3D skeleton point clouds and improve the accuracy of recognition models. To verify the effectiveness of the sampling grouping strategy, five different fusion strategies were compared and tested, with Top-1 as the indicator. The results are shown in Fig. 10.

Figure 10.

Comparison curve of Top-1 accuracy for five different fusion strategies.

From Fig. 10, the individual PontNet++ and I3D achieved accuracy of 83.9% and 84.1%, respectively. After integrating the input, intermediate, feature level, attention, and decision-making layers, the accuracy rates were 84.3%, 87.5%, 90.2%, 90.6%, and 91.2%, respectively. The accuracy of the model fused with five strategies was improved, proving that the fusion strategy had an improved effect on the accuracy of action recognition, which was conducive to better action recognition. Images from real motion scenes were selected, and recognition tests were performed on individual 3D skeleton point cloud models and fusion models in real scenes. The test results are shown in Table 3.

Table 3

Recognition results in real scenarios

Actual action	Cloud import	Fusion recognition
Playing football on the lawn	Playing football	Playing football on the lawn
Playing table tennis in front of the table	Playing table tennis	Playing table tennis in front of the table
Shooting under the basketball frame	Shoot	Shooting under the basketball frame
Running on the track	Run	Running on the track
Playing badminton in the gym	Playing badminton	Playing badminton in the gym

From Table 3, a single skeleton point cloud model could only recognize the basic information of human actions, but could not recognize specific scenes. The fused model could accurately recognize the information of interactive objects, increasing the accuracy of recognition and reducing the difficulty of recognizing similar actions.

5. Conclusion

To better identify human actions in sports, an MSORIN was constructed based on the Transformer module and multi-modal data. Based on PointNet++ network, 3D skeleton point cloud and RGB data were integrated to construct an action recognition model. The experimental results showed that the Relation_Transformer model had a significant advantage in comparison with the other three models, and its Top-1 and Top-5 accuracy rates reached 68.8% and 90.2%, respectively. Especially in the test of a small number of data sets, the model still maintained a high accuracy, proving its strong generalization ability and stability. In addition, the multi-modal data fusion strategy also greatly improved the recognition accuracy, and the Top-1 and Top-5 accuracy rates were significantly improved compared with single-modal and dual-modal data. In the RGB+3D model, it achieved satisfactory results, and the accuracy of Top-1 and Top-5 reached 81.2% and 91.2%, respectively, which was better than other similar algorithms. After the 3D skeleton point cloud was fused with the three features, the recognition accuracy reached a new height, with the accuracy of 82.6% and 92.7%, which fully proved the effectiveness of the multi-modal fusion strategy in the field of action recognition. Through the gradual fusion of input, intermediate, feature level, attention and decision-making layers, the fusion of each layer brought performance improvement to the model, which further validated the important role of multi-modal data in making up for the missing information of 3D skeleton point cloud. By combining Transformer module and multi-modal data, remarkable achievements were made in the field of action recognition, which provides new ideas and methods for research in related fields. However, there are potential limitations in the research. The action recognition model currently studied can only recognize single action, and the model may not be able to process the action recognition conducted by multiple people at the same time, or distinguish the actions of different individuals in a multi-person scene. The testing and training of a model may depend on a particular data set, which may lead to limited generalization of the model in other data sets or real-world scenarios. The use of Transformer modules and PointNet++ networks requires high computing resources, which may limit the application of the model in resource-constrained environments. Constraints that may affect the generalizability of results include that the model overfits on a particular dataset, limiting its ability to generalize on other datasets or data from different sources. The research does not fully consider the diversity of different motion scenes and backgrounds, and the changes of scenes in practical applications may affect the recognition effect of the model. The complexity and variety of movements in sports may be beyond what is covered when the model is trained, especially for the recognition of subtle or unconventional movements.

Footnotes

Conflict of interest

The author declares no conflict of interest.

References

Chen

Yang

Xie

. Multimodal fusion network with contrary latent topic memory for rumor detection. IEEE Multi Media.2022; 29(1): 104-113.

Planchuelo-Gómez

García-Azorín

Guerrero

. Multimodal fusion analysis of structural connectivity and gray matter morphology in migraine. Human Brain Mapping.2021; 42(4): 908-921.

Gao

Wang

Chen

Patel

Liu

Shi

. MMTSA: Multi-Modal Temporal Segment Attention Network for Efficient Human Activity Recognition. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies.2023; 7(3): 1-26.

Zhang

Zhao

Liu

Wei

Chu

Xue

. Multi-modal fusion method for human action recognition based on IALC. IET Image Processing.2023; 17(2): 388-400.

Chen

Zhou

. HMGAN: A Hierarchical Multi-Modal Generative Adversarial Network Model for Wearable Human Activity Recognition. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies.2023; 7(3): 1-27.

Zhang

Diao

. Deep-Learning-Based Human Intention Prediction Using RGB Images and Optical Flow. Journal of Intelligent & Robotic Systems.2020; 97(1): 95-107.

Hou

Wang

Gao

. Trear: Transformer-based rgb-d egocentric action recognition. IEEE Transactions on Cognitive and Developmental Systems.2021; 14(1): 246-252.

Cheng

Ren

Zhang

Gao

. Cross-modality compensation convolutional neural networks for RGB-D action recognition. IEEE Transactions on Circuits and Systems for Video Technology.2021; 32(3): 1498-1509.

Weiyao

Muqing

Min

Ting

. Fusion of skeleton and RGB features for RGB-D human action recognition. IEEE Sensors Journal.2021; 21(17): 19157-19164.

10.

Yangzhi

Yuan

Liu

. Human skeleton-based action recognition algorithm based on spatiotemporal attention graph convolutional network model. Journal of Computer Applications.2021; 41(7): 1915-1921.

11.

Ahmad

Jin

Lin

. Skeleton-based action recognition using sparse spatio-temporal GCN with edge effective resistance. Neurocomputing.2021; 423(4): 389-398.

12.

Zhao

She

Meng

Chen

. Skeleton Action Recognition Based on Multi-Stream Spatial Attention Graph Convolutional SRU NetworkActa. Electronica Sinica.2022; 50(7): 1579-1585.

13.

Kawamura

Matsubara

Uehara

. Deep State-Space Model for Noise Tolerant Skeleton-Based Action Recognition. IEICE Transactions on Information and Systems.2020; E103.D(6): 1217-1225.

14.

Yang

Song

. Research on face intelligent perception technology integrating deep learning under different illumination intensities. Journal of Computational and Cognitive Engineering.2022; 1(1): 32-36.

15.

Mai

Xing

. Modality to modality translation: An adversarial representation learning and graph fusion network for multimodal fusion.Proceedings of the AAAI Conference on Artificial Intelligence. 2020; 34(1): 164-172.

16.

Zhou

Dong

Lei

. MTANet: Multitask-aware network with hierarchical multimodal fusion for RGB-T urban scene understanding. IEEE Transactions on Intelligent Vehicles.2022; 8(1): 48-58.

17.

Zhang

Wang

. A multimodal fusion fatigue driving detection method based on heart rate and PERCLOS. IEEE Transactions on Intelligent Transportation Systems.2022; 23(11): 21810-21820.

18.

Dai

. Mufasa: Multimodal fusion architecture search for electronic health records.Proceedings of the AAAI Conference on Artificial Intelligence. 2021; 35(12): 10532-10540.

19.

Maglanoc

Kaufmann

Jonassen

. Multimodal fusion of structural and functional brain imaging in depression using linked independent component analysis. Human Brain Mapping.2020; 41(1): 241-255.

20.

Luo

Jingying

Guangshuai

. A three-dimensional model of student interest during learning using multimodal fusion with natural sensing technology. Interactive Learning Environments.2022; 30(6): 1117-1130.

21.

Ding

Wang

Gao

. Multimodal fusion-adaboost based activity recognition for smart home on wifi platform. IEEE Sensors Journal.2022; 22(5): 4661-4674.

22.

Deligani

Borgheai

McLinden

. Multimodal fusion of EEG-fNIRS: a mutual information-based hybrid classification framework. Biomedical Optics Express.2021; 12(3): 1635-1650.

23.

Lei

. Research on microvideo character perception and recognition based on target detection technology. Journal of Computational and Cognitive Engineering.2022; 1(2): 83-87.

24.

Qiu

Zhao

Jiang

. Multi-sensor information fusion based on machine learning for real applications in human activity recognition: State-of-the-art and research challenges. Information Fusion.2022; 80(1): 241-265.

Sports action recognition algorithm based on multi-modal data recognition

Abstract

Keywords

1. Introduction

2. Related works

3. Construction of a sports action recognition model based on multi-modal data recognition

3.1 Construction of an MSORIN model based on multi-modal fusion

4.1 Testing and analysis of MSORIN

Table 1 Comparison of indicators between models

Footnotes

Conflict of interest

References

Table 1
Comparison of indicators between models