Martial arts action recognition based on improved method of human pose estimation using stacked hourglass network

Abstract

In martial arts action recognition, complex poses, occlusions, and dynamic changes can lead to insufficient estimation accuracy, and traditional methods suffer from inaccurate joint localization and poor recognition of continuous pose jumps. In response to this, the research proposes the Dilated Convolution-Attention-Stacked Hourglass Network Pose Estimation (MS-DConv-Att-SHN) to achieve martial arts action recognition. Firstly, multiple dilated convolutions and mixed attention are proposed to improve the computational complexity and loss of joint information in stacked hourglass networks, enhancing the ability to capture subtle joint displacements. This is crucial for accurate estimation of complex poses such as movement, flicker, and virtual real transformations in martial arts. Secondly, in response to the strong coherence of martial arts movements, local feature refinement and channel fusion techniques are used to enhance the correlation analysis between consecutive action frames, solving the problem of traditional methods’ fragmented recognition of action chains such as “exertion contraction.” A martial arts action recognition system based on the improved MS-DConv-Att-SHN method has been developed to better identify individual movements and capture the intrinsic relationships between movements in routines. This provides key technical support for the digital inheritance, intelligent evaluation, and standardized promotion of martial arts, making it more closely aligned with the movement characteristics of martial arts that combine form and spirit. The results indicate that structural improvements to the stacked hourglass network can effectively increase its percentage of correct keypoints (PCK) and mean average precision (mAP) for both datasets, with PCK and mAP values exceeding 92% and 85%, respectively. The average recognition accuracy of attitude keypoints in the MS-DConv-Att-SHN model is superior to other comparison models, with a difference of over 1.2% compared to other models. The improved MS-DConv-Att-SHN model achieves recognition accuracy of over 90% for different martial arts movements, showing smaller parameter counts and PCK values compared to other comparative models. The research method can effectively provide technical support for the automation analysis of martial arts movements, sports training assistance systems, and intelligent martial arts teaching.

Keywords

stacked hourglass network human posture martial art attention module PCK mAP accuracy rate local features

Introduction

The breakthroughs in computer vision and deep learning technology have made human pose estimation technology an important support for intelligent applications in multiple fields, and it has gradually expanded to multidimensional pose estimation, lightweight deployment, and target tracking due to development needs, attracting the attention of most scholars.¹ Martial arts, as an important part of traditional Chinese culture, have unique morphological characteristics and temporal and spatial complexity in their movements. Each move and gesture embodies the wisdom of the Chinese nation and has positive significance such as strengthening the body and exercising willpower.² However, traditional martial arts movements involve hand shapes, footwork, moves, and other aspects, with many standards and classifications, and mainly rely on manual observation for evaluation. The evaluation results are subjective, and the overall application efficiency is relatively low. These martial arts movements are often taught through mentorship, and there are differences between different schools, which increase the difficulty for ordinary people to learn and makes it difficult to achieve standardization and large-scale application and promotion.³ Martial arts movements are complex and diverse, involving many non-standard postures such as soaring and rotating, and some movements have quick transitions, emphasizing the combination of spirit and form.⁴ Existing research has attempted various methods to improve the recognition effect of martial arts movements, such as Zhao et al. optimizing the background subtraction algorithm to enhance the recognition effect of martial arts movements, which extracts background disparity using symmetric disparity methods and proposes using disparity consistency constraints to eliminate artifacts generated by target movements. The results show that this method can significantly enhance the foreground segmentation effect and eliminate shadows.⁵ Li et al. extracted martial arts incorrect actions based on keyframes and identified them using optical flow method and K-means clustering algorithm. The results show that the recognition accuracy of this method exceeds 95%, and the recognition time is less than 12 s.⁶ Husheng et al. used wavelet transform and sliding window technology to segment data, and constructed a hidden Markov model and Viterbi algorithm to achieve martial arts action state recognition. The results show that the recognition accuracy of this method far exceeds 95% and has good recognition performance.⁷ However, most of these methods rely on traditional computer vision technology and lack modeling of the spatiotemporal correlations of complex martial arts movements, making it difficult to adapt to challenges such as rapid changes in moves and occlusion. At present, the collection of martial arts action data mostly relies on single camera shooting, involving few professional datasets, and has problems such as high annotation costs and obvious data occlusion.⁸

The Stacked Hourglass Network (SHN) is widely used in human pose estimation and keypoint prediction due to its unique recursive refinement mechanism, which can provide a high-precision spatial feature extraction basis for martial arts action recognition. However, there are still significant limitations in processing martial arts movements: (1) high computational complexity, making it difficult to meet real-time requirements. (2) The loss of key point information leads to significant estimation errors in rapid movements (such as “whirlwind feet”) or occlusion situations (such as “cloud hand” sleeve occlusion). (3) It is difficult to distinguish similar postures, and subtle differences in movements such as “lunges” and “horse steps” can easily be misjudged.⁹ To improve the two major technical bottlenecks faced by traditional martial arts, namely, low accuracy in recognizing complex martial arts movements and reliance on professional annotated data, and insufficient model adaptability in handling subtle differences in martial arts movements, this study proposes a method for martial arts movement recognition through an improved human pose estimation method, by introducing multiple dilated convolution and attention mechanisms, as well as refining local features, to promote the intelligent transformation of martial arts teaching, competitive scoring, and other applications. The research aims to improve the stacked hourglass network SHN and design a lightweight pose estimation model to address key issues such as high-precision spatiotemporal modeling, real-time optimization, and standardization support in martial arts action recognition. The innovation of the research lies in two aspects. Firstly, using Multi-scale Dilated Convolution (MS-DConv) and mixed attention to achieve lightweight design of stacked hourglass, providing expression and correlation of different features. Secondly, considering the difficulty of skeleton data recognition, it is proposed to use local feature refinement and channel fusion methods to further improve martial arts action posture recognition and feature capture, and reduce the interference caused by occlusion. The research aims to provide a new technological path for the modernization and inheritance of traditional martial arts by designing a method for recognizing martial arts movements and postures.

Literature review

Human pose estimation can achieve martial arts action recognition by detecting the position and motion trajectory of key points in the human body, and its spatiotemporal features can effectively improve the accuracy of action recognition. Xu proposed a method that integrates resolution-aware networks, self-supervised loss, and learning schemes to improve the accuracy of 3D human pose recognition. The results showed that this method could achieve good processing of low-resolution image videos and enhance the consistency of deep features.¹⁰ Papic et al. used neural networks to analyze sports videos and found that this method could effectively identify and detect body marker positions, reducing motion analysis time.¹¹ Chen proposed a pose recognition algorithm based on random forest and bone feature extraction for the analysis of leg movement recognition in martial arts. The results indicated that this method could effectively classify different leg movements and had good posture recognition performance.¹² The coordination ability of sports is closely related to posture balance. Cherepov et al. used experimental design to analyze the coordination ability of different martial arts athletes and found that coordination ability affected posture balance and sports performance.¹³ Yamei et al. designed a dynamic light acquisition system and proposed combining virtual reality devices for interactive martial arts teaching. The results indicated that this method could significantly improve the effectiveness and quality of martial arts teaching.¹⁴ Echeverria et al. applied pose estimation algorithms to extract predefined action features, and then analyzed the psychomotor performance in martial arts movements. The results indicated that this method had good effectiveness and could provide reference for the analysis of martial arts movements.¹⁵ Pang et al. proposed to improve the accuracy and objectivity of martial arts action scoring and feedback by using feature alignment techniques and adaptive weighted multi models to extract key features. The results showed that the error values of this method on multiple datasets were smaller than other comparison algorithms, and it could effectively achieve automatic scoring of martial arts.¹⁶ Jia et al. used optical sensors to capture martial arts movements and converted them into visual image trajectories. The results indicated that the system performed well in capturing and simulating image trajectories.¹⁷

Considering that existing video analysis methods are difficult to accurately model the subtle differences in martial arts and robustly expand, Chen developed a deep learning framework and utilized integrated convolutional neural networks and gated recurrent units to achieve visual feature extraction and pose sequence processing. The results showed that the accuracy of martial arts action recognition on the dataset exceeded 90%, and the mean square error of pose estimation was less than 3, indicating good martial arts skill evaluation performance.¹⁸ Hui Scholar has developed an AI based visualization system for martial arts training movements to address the issue of low popularity of martial arts education. Using bone feature extraction and long short-term memory network action classification model, combined with object-oriented graphics rendering engine to achieve action rendering. The results indicate that this method can effectively recognize martial arts movements and achieve dynamic visualization, providing a feasible solution for remote martial arts teaching.¹⁹ Lei et al. proposed a kinematic analysis method for Sanda athletes based on feature extraction. Optimize action feature selection through adaptive enhancement algorithm and analyze the time characteristics of different action stages. The results indicate that this method can accurately extract technical action features and provide data support for athlete tactical evaluation.²⁰ Liu et al. utilized accelerometers and machine learning to recognize kicking movements in taekwondo, and analyzed waist and ankle sensor data using support vector machines and decision trees. It was found that combining only waist data with support vector machine model can achieve a classification accuracy of 96%, which reduces the equipment dependence of taekwondo movement monitoring.²¹ Cheng et al. constructed a motion training management system based on convolutional neural networks, achieving an accuracy rate of over 90% in action recognition and taking only half the time of traditional methods.²² Shang et al. utilized neural networks to enhance martial arts action recognition and assist in training, tracking joint angles. The results showed that the action recognition accuracy of this method approached 95% and had good auxiliary training effects.²³

The existing research on martial arts action recognition mainly focuses on multi-modal fusion, neural networks, feature optimization, sensors, and other aspects to achieve human pose estimation. The key feature is to improve feature extraction or temporal modeling to enhance recognition performance. Although the above methods can improve the accuracy of pose estimation and action segmentation classification to a certain extent, they still have the following shortcomings: Firstly, their sensitivity to subtle pose differences in martial arts (such as joint rotation angles) is insufficient. Secondly, it is difficult to balance real-time performance and robustness, and it is more susceptible to background interference or rapid actions, leading to error accumulation. Based on this, the study proposes the implementation of martial arts action recognition using the MS-DConv-Att-SHN method. Specifically, human pose estimation achieves martial arts action recognition by detecting key point positions and motion trajectories, and its spatiotemporal features can effectively improve recognition accuracy. Xu proposed a method that integrates resolution-aware networks with self-supervised loss, significantly improving the accuracy of 3D pose estimation in low-resolution videos. However, this method has not been optimized for the common fast movements and occlusion problems in martial arts, and the robustness of the model in occlusion situations is slightly insufficient. Regarding the analysis of martial arts movements, the method proposed by Chen et al. relies heavily on static skeletal data and is difficult to handle rapid variations in moves. Similarly, the integrated framework proposed by Chen et al. exhibits high computational complexity. In contrast, the research method utilizes the design ideas of multiple dilated convolutions and mixed attention to achieve better spatiotemporal feature fusion with lower computational load, enhancing the ability to capture subtle postures in martial arts. Moreover, existing research has paid insufficient attention to the issue of occlusion in martial arts. For example, although Hui’s AI visualization system can render 3D actions, it has not solved the problem of missing key points under partial occlusion. In addition, although Liu’s accelerometer scheme is effective for classifying Taekwondo kicks in free environments, it relies on hardware deployment. Pang’s feature alignment technique did not consider the impact of action coherence on scoring. The real-time performance of Yamei’s VR teaching system is insufficient. The research method proposes to refine the local features of skeleton data to avoid difficulties in distinguishing similar poses, and the idea of channel fusion can also effectively correlate the correlation between different pose action features, balancing real-time performance and accuracy.

Martial arts action recognition based on improved SHN human pose estimation method

Design of improved method for SHN human pose estimation

The cascaded network structure of SHN can extract feature maps of different scales through repeated downsampling (encoding) and upsampling (decoding), and its top-down and layered design structure can achieve multi-scale information fusion, obtaining feature maps that are the same size as the input image.²⁴ In SHN, the feature map of the input image after preliminary feature extraction and convolution processing will be used as the input for the next hourglass network. It achieves feature learning through loss comparison and gradient descent, and its multi-stage intermediate supervision can effectively achieve feature prediction, thereby improving prediction accuracy.^25,26 To reduce the computational complexity of network parameters, research is being conducted on using skip connections to improve the training speed of encoders and decoders. This can be achieved by reconstructing higher-level features in a stacked form, which can improve network performance while ensuring that the network size remains essentially unchanged. Figure 1 shows the lightweight hourglass module structure improved by multiple convolutions.

Figure 1.

Lightweight hourglass module structure improved by multiple convolutions.

In Figure 1, the hourglass module is composed of multiple convolution improved residual blocks, which include encoder and decoder architectures. The residual blocks in the purple dashed box are saved and fused with multi-layer encoder features, and the stacked feature map can preserve the hierarchical information of the image without changing the size of the input image. Convolution processing can generate heat maps with different feature probabilities along the edges, thereby improving recognition accuracy. The residual module constructed by dilated convolution can replace the middle layer of the pre activated residual block with depthwise separable convolution, and use convolution processing to reduce the size of the residual network and expand the receptive field. Considering that SHN is inevitably affected by noise and occlusion when collecting features of different resolutions, a Local Attention (LA) module containing position and channel is added to filter irrelevant information. The position attention in the LA module can weight the attention of channel features to better correlate spatial contextual information, while channel attention focuses on distinguishing different feature channels.²⁷ Meanwhile, to integrate the global information of the image, the research also introduces the attention features generated by the attention module into the next layer, achieving the construction of the global attention (GA) framework. Figure 2 shows the framework of LA and GA modules.

Figure 2.

Framework of LA and GA modules.

The LA module in Figure 2 can use positional attention and channel attention to re integrate the features of the hourglass network, and fuse the obtained attention features with the original features to generate features of different scales. GA applies lightweight processing to the input image, and then maps and combines the feature maps input by each unit. The fused result can be used as the input for the next hourglass unit, and the feature maps under the stack structure can obtain image information at different levels. Equation (1) is the mathematical expression of attention mapping.²⁸

{\begin{cases} ϕ (λ) = e^{S (λ)} / \sum_{λ^{'} \in Q} e^{S (λ^{'})} \\ h_{r}^{p a t t} = f_{r} * \sum_{r} ϕ_{r}^{p} \\ h_{r}^{a t t} = \sum_{c} (h_{r}^{p a t t} * ϕ_{r}^{c}) \end{cases}

(1)

In equation (1),

f_{r}

is the feature mapping,

Q

is the query matrix,

ϕ_{r}^{p}

is the LA mapping generated by positional attention,

h_{r}^{p a t t}

is the positional attention feature,

ϕ_{r}^{c}

and

S (λ)

are the channel attention mapping and feature mapping,

h_{r}^{a t t}

is the channel attention feature,

ϕ

is the attention mapping, and

*

is the Hadamard matrix product.

c

is the weighted feature, and

e

is the unnormalized attention weight. The final global output feature map can be represented as equation (2).

{\begin{cases} g = f_{n} + h_{n}^{a t t} \\ h_{i}^{a t t} = f_{i} * ϕ_{i} \end{cases}

(2)

In equation (2),

f_{n}

represents the output feature map of the unit,

ϕ_{i}

represents the attention mapping,

h_{n}^{a t t}

represents the attention feature, and

g

represents the feature map.

f_{i}

is the feature map generated by the

i

-th hourglass network. Based on the above content, the Multi-Expansion Convolution-Attention-Stacked Hourglass Network Pose Estimation (MS-DConv-Att-SHN) can be designed. The network structure diagram is shown in Figure 3.

Figure 3.

MS-DConv-Att-SHN.

In Figure 3, the MS-DConv-Att-SHN mainly includes lightweight modules for multi-convolution processing, LA modules, and GA modules. After multiple convolution processing of the input image, feature maps at different resolutions can be obtained. The feature maps are processed by the attention module to obtain attention features. After applying regression loss processing to the multi-layer attention features obtained from stack processing, the final action posture recognition result can be output.

Martial arts action recognition based on improved method of human body posture estimation

To better identify similar martial arts movements, it is not only necessary to consider their global posture, but also to pay attention to local postures, in order to better distinguish the characteristics of different types of movements. Therefore, based on the above research, a design idea of refining and fusing local features is proposed to improve the performance of MS-DConv-Att-SHN network. Human action recognition mainly includes four aspects: image acquisition, skeleton data acquisition, feature extraction, and recognition. How to effectively recognize actions with skeleton data is an important content.²⁹ On the basis of considering the correlation between local human body parts, a local spatiotemporal feature extraction module is added to the MS-DConv-Att-SHN network to further subdivide and analyze the fused information feature module. Figure 4 is a schematic diagram of the improved MS-DConv-Att-SHN structure.

Figure 4.

Improved MS-DConv-Att-SHN.

In Figure 4, the local spatiotemporal feature extraction module achieves local feature refinement through a dual branch architecture (spatial branch + motion branch). It inputs adjacent frames of the video clip into an improved human pose estimation model, generating a human pose estimation map (including keypoint coordinates and confidence). Afterward, spatial features are extracted from the pose map using a 7 × 7 convolutional layer, preserving the topological relationships between joint points to achieve motion feature extraction. The differential calculation of motion branches can capture small displacements, while the channel weighting mechanism can suppress ineffective motion noise in occluded areas. Feature refinement can solve the core problems of local information loss and occlusion interference in martial arts movements. The adjacent video frames that are input will be extracted through the aforementioned human pose estimation model to generate a pose estimation map. The pose estimation map obtains spatial features through convolutional layers, and then calculates the pixel differences of video frames and stacks them according to the channel dimension to obtain motion differential feature $X_{D}$ while reducing computational costs, as shown in equation (3).³⁰

X_{D} = C N N (A v g p o o l (D (I_{i}))), X_{D} \in R^{N \times 64 \times H / 4 \times W / 4}

(3)

In equation (3),

N

is the batch size,

H

is the height and width of the feature map,

W

is the channel dimension,

C N N

is a convolutional operation and

I_{i}

is the average pooling operation. After performing global average pooling operation on differential features, global information can be obtained, which can then be processed through convolution and activation functions to obtain the dependency relationship of channel features.³¹ The local motion features are related to the differential features and their weights, and the output result

X_{L}

of local spatiotemporal feature extraction are shown in equation (4).

X_{L} = X_{s} + U p s a m p l e (\bar{X} D), X_{L} \in R^{N \times 64 \times H / 4 \times W / 4}

(4)

In equation (4),

X_{s}

represents spatial features,

U p s a m p l e

represents upsampling, and

\bar{X} D

represents local motion features. In the local feature refinement section, the study splits the predefined skeleton into upper/lower limb subgraphs and dynamically corrects the features through error set and fuzzy set calibration. And gather misclassified features from the training samples, calculate the mean as the error prototype, suppress the propagation of erroneous features through auxiliary loss terms during reverse optimization, cluster the features of low confidence keypoints (such as wrists obscured by sleeves), update their global representations to reduce noise interference. Adjacent frame difference calculation can highlight continuous motion areas (such as kicking trajectories) and suppress static obstructions (such as fluttering martial arts clothing). The establishment of feature space anchor points can serve as a benchmark for inferring occluded parts. The dynamic correction of fuzzy sets can eliminate features that deviate from the natural kinematic constraints of limbs (such as 180° abnormal bending of forearms), and the adversarial learning of error sets can improve the robustness of the model by analyzing the distribution differences between occluded features and normal features. The study further classified local features and refined their action categories by performing error set and fuzzy set center representation calculations on the predicted features. The center representation can be expressed as equation (5).

{\begin{cases} u_{F P}^{k} = \frac{\sum_{j \in S_{F P}^{k}} F_{j}}{n_{F P}^{k}} \\ u_{N P}^{k} = \frac{\sum_{j \in S_{N P}^{k}} F_{j}}{n_{N P}^{k}} \end{cases}

(5)

In equation (5),

u_{F P}^{k}

and

u_{N P}^{k}

are the central feature of the erroneous prototype and the fuzzy prototype,

n_{F P}^{k}

and

n_{N P}^{k}

are the corresponding total number of samples,

k

is the action category, and

F_{j}

is the feature extracted from the sample space. Afterward, the classified dataset is calibrated by introducing auxiliary terms from the feature space to calculate the cosine similarity between samples, in order to achieve sample feature correction.³² Clustering ensures that the action features of the set are updated to achieve global action representation, and its mathematical expression is shown in equation (6).

P_{k} = (1 - α) \cdot u_{F P}^{k} + α \cdot P_{k}

(6)

In equation (6),

α

is the momentum parameter,

P_{k}

is a special collection for action

k

. Afterward, the research proposed a fusion module that combines the three primary color model, Red Green Blue (RGB), with frequency domain information, allowing the model to pay more attention to abnormal features and subtle motion changes. The structural features are shown in Figure 5.

Figure 5.

Feature information fusion module.

In Figure 5, the frequency domain information and color channel information features are connected together and subjected to batch normalization and nonlinear convolution processing to obtain fused information. The mathematical expression of fused feature $V$ is shown in equation (7).

V = δ (β ({Conv}_{1 * 1} (U)))

(7)

In equation (7),

δ

is the ReLU function,

U

is the connection feature, and

β

is the batch normalization operation. Afterward, attention maps were used to highlight the regions of interest in the features, and the dot product between the feature map and attention map was used to represent important regions, thereby grasping the correlation between local regions. In the lightweight design of multiple dilated convolutions, we study the construction of multiple dilated convolution residual modules to replace traditional residual blocks. This involves using a pyramid dilation rate to achieve multi-scale receptive field coverage in a 3 × 3 depth separable convolution, and utilizing a depth separable convolution structure to reduce the number of parameters in each residual module while maintaining feature dimension consistency. In the integration of attention mechanisms, the research will integrate the local global attention collaborative architecture into each hourglass unit. The position attention of the local attention module calculates spatial correlation through the covariance matrix between channels, enhances the mutual localization of adjacent joint points, and generates an attention heatmap to focus on key motion regions. The channel attention adaptively weights the feature channels. The global attention module transfers attention features between stacked hourglass layers and achieves progressive repair of occluded joints through cross layer attention mapping. On the basis of considering the multi-scale characteristics of martial arts movements, this study uses multi-scale convolution to enhance the differentiation of features at all levels. High resolution focuses on local details, while low resolution enhances global posture.

Design of martial arts action recognition system

The estimation and recognition of human posture in martial arts movements consists of two parts: posture estimation and action recognition. The research designs a system from the aspects of data collection, posture estimation, action recognition, and data management. In the data collection section, the study utilized high frame rate RGB cameras (≥60 fps) and depth sensors (Kinect) for real-time video data collection. The collection environment was a standardized martial arts training ground, and lighting conditions were controlled. The constructed martial arts action set included different action categories. After inputting human skeleton data, the improved MS-DConv-Att-SHN network method was used to achieve human action classification and recognition. The system can display and manage the skeleton action judgment results, so as to view and analyze the human pose estimation and action recognition results at any time. Figure 6 is a schematic diagram of the overall system architecture design.

Figure 6.

Schematic diagram of overall system architecture design.

In Figure 6, the system includes an acquisition layer, a processing layer, and an application layer. The data acquisition layer mainly collects videos and stores them for backup, while the data processing layer implements preprocessing of video frames and extraction of pose and action features and types. The data application layer provides interactive functions such as result display and data management. Figure 7 shows the system implementation content.

Figure 7.

Implementation content of martial arts action recognition system.

In Figure 7, the improved MS-DConv-Att-SHN network can achieve human pose estimation, where each hourglass module includes downsampling and upsampling paths, skip connections preserve spatial information, and attention mechanisms achieve multi-scale feature fusion. The refinement and improvement of local features can recognize actions and finally display the recognized action categories. The spatiotemporal feature database stores daily training data and automatically generates training reports based on attention weighted keyframe extraction technology. The system platform can quantify progress indicators by comparing historical data, optimize the lightweighting degree of models for different scenarios, and establish a martial arts action standard parameter library as a benchmark for evaluation.

This research created a self-made dataset of martial arts movements and selected the movements from “Chinese Fist” for recognition, including horse step frame fighting, withdrawal step fork palm, parallel step punching, bow step punching, and other movements. By using binocular cameras to record video data and annotating different martial arts movements separately, the resulting martial arts video movements can be segmented into frames. In data collection, the video stream receiving program programming interface and action recognition interface include two parts: single action and continuous action.³³ In system action evaluation, the research selects indicators such as the percentage of correct keypoints (PCK) for pose evaluation, and evaluates action recognition using indicators such as classification accuracy, confusion matrix, and mean average precision (mAP) for temporal action detection.^34,35 Equation (8) is the mathematical expression for PCK and PCP.

{\begin{cases} P C K_{i}^{k} = \frac{\sum_{p} κ ([d_{p i} / d_{p}^{def}] \leq T_{k})}{\sum_{p} 1} \\ P C P = m / n \end{cases}

(8)

In equation (8),

κ

is the judgment condition,

i

is the joint point,

κ

is the threshold,

p

is the human body,

d_{p i}

is the Euclidean distance between the predicted keypoint value and the manually annotated value,

d_{p}^{def}

is the scale factor,

m

is the number of correctly detected joint points, and

n

is the total number of joint points. The Object Keypoint Similarity (OKS) metric can be used to evaluate overall posture estimation therapy, which can reflect the difficulty of detecting different keypoints by using the similarity between predicted keypoints and real keypoints. Equation (9) is the mathematical expression of OKS.³⁶

O K S = \frac{\sum_{i} \exp (- \frac{d_{i}^{2}}{2 s^{2} k^{2}}) γ (v_{i} > 0)}{\sum_{i} κ (v_{i} > 0)}

(9)

In equation (9),

d_{i}

is the Euclidean distance between the predicted value and the true value,

k

is the keypoint constant,

v_{i}

is the visibility of the keypoint,

γ

is the visible pulse function, and

s^{2}

is the area occupied by the human body in the image for detection.

Analysis of martial arts action recognition results under the improved estimation method

Experimental environment setting and dataset source

To better recognize martial arts movements, an experimental environment was set up with an operating system of Windows 10 64bit, DDR4 2400 MHz 32 GB of memory, 12th Gen Intel (R) Core (TM) i5-12400F CPU model, PyTorch 3.8 deep learning framework deployed for system development, Python 3.10 programming language, FastAPI backend service, and Vue. Js frontend interface. The software platform was OpenCV, with a video file resolution of 480 p and a frame rate of 30 fps. The parameter optimizer was Adam, with an initial learning rate of 0.001, 200 iterations, and a batch size of 16. The study selected publicly available datasets of MPII and NTU RGB + D 60, as well as a self-made martial arts action dataset for training and testing. The MPII dataset contains over 20000 human pose estimation images and over 40000 body joint annotation information, providing rich human pose annotation information. The NTU RGB + D 60 dataset includes 56880 action samples and 60 types of martial arts movements. The RGB video is 1920 × 1080 (30 FPS), and skeletal keypoints were captured using Kinect v2, including martial-arts-related actions such as kicking and punching. The research randomly divided the public dataset into a test set and a training set in a 6:4 ratio, and set a threshold of 0.9 based on the similarity calculated from posture. A self-built martial arts action dataset was collected using a depth camera to capture martial arts videos. A total of 300 videos were recorded, and martial arts videos were randomly selected as the training and validation sets in a 5:1 ratio. The main action points (such as fist hugging, cross palm, split palm, bright palm, counter fist, grid fist, punch, frame punch, and piercing palm) were annotated. The research set the time series output length and key point count of the dataset to 100 and 46, respectively. Indicators such as PCK, PCP, accuracy, and confusion matrix were employed.

Effectiveness of SHN for human pose estimation method

The study conducted multiple convolution residual improvement on SHN to better evaluate the impact of residual structure improvement on human pose estimation. SHNs with different residual structures were evaluated, and the compared residuals included bottleneck residual block improvement (Bottleneck Res), depthwise separable convolution improvement (DS Conv), MS-DConv, and standard error residual (SE residual), with the help of parameter quantity and a threshold of 50% PCK value(PCK@0.5)The comparison results are shown in Figure 8.

Figure 8.

Performance testing of improved structure of SHN.

In Figure 8(a), although the residual replacement improvement idea of depthwise separable convolution could reduce the number of network parameters, its PCK value decreased significantly, with a value that was less than 78%. The addition of multiple dilated convolutions reduced the number of parameters, with the value dropping from 3.68 M to 2.12 M. The PCK performance was less affected by interference. The PCK@0.5 value and inference speed reached 84.15% and 38 FPS, respectively. The parameter size of bottleneck residuals (1.25 M) performed better than that of standard error residuals (3.74 M), but their PCK values did not exceed 80%, and the inference speed was slower. In Figure 8(b), the detection accuracy displayed by the expansion scale of (2, 3) was higher than that of (3, 5) under the same number of stacks. The PCK values for stacks 1, 2, and 3 were 78.26%, 79.84%, and 81.25%, respectively. The study determined the expansion size of the improved SHN to be (2, 3). Afterward, the detection performance of the research methods was compared using ablation experiments, and the results are shown in Table 1.

Table 1.

Ablation results under human pose estimation.

Structure	Parameter quantity (M)		PCK(%)		mAP(%)		FPS
Structure	MPII	NTU RGB + D 60	MPII	NTU RGB + D 60	MPII	NTU RGB + D 60	MPII	NTU RGB + D 60
SHN	7.95	7.01	77.08	77.59	82.52	83.88	45	47
SHN + MS-DConv	5.02	4.21	79.96	81.52	83.31	84.67	52	54
SHN + LA	7.84	4.11	78.73	80.35	83.14	84.52	58	60
SHN + GA	7.83	6.89	78.28	79.17	83.25	84.61	41	45
SHN + MS-DConv + LA	5.02	6.52	88.83	90.59	84.39	85.75	43	47
SHN + MS-DConv + GA	7.91	6.48	84.61	85.93	84.42	85.78	38	40
SHN + LA + GA	5.04	6.77	88.17	89.14	82.16	83.52	45	48
MS-DConv-Att-SHN	4.69	4.21	92.49	93.84	87.77	89.13	52	56

In Table 1, adding different modules to the SHN method on both datasets was able to improve detection accuracy to varying degrees. Specifically, after introducing the MS-DConv module into the SHN model, the numerical amplitude of the PCK index surpassed 2.5% on both datasets, and the number of model parameters decreased. The introduction of LA and GA modules led to a PCK value of no more than 81% and an mAP value of no more than 85% for the SHN model on both datasets, which limited the model’s performance improvement. However, it also showed good detection improvement effects. By synergistically integrating different modules of the STN model, its detection performance could be greatly improved. The PCK values of the MS-DConv-Att-SHN model on the MPII dataset and NTU RGB + D 60 dataset were 92.49% and 93.84%, respectively, and the mAP values were 87.77% and 89.13%, respectively. The number of parameters was significantly reduced, and the running speed was also enhanced. The multiple dilated convolution (MS-DConv) significantly reduced the number of model parameters (MPII decreased from 7.95 M to 5.02 M), while improving the PCK index by 2.88–3.93%, proving that its multi-scale receptive field can effectively capture the spatiotemporal characteristics of martial arts movements. Especially on the NTU dataset, it performs better, indicating stronger modeling ability for complex interactive actions. The Local Attention (LA) module improves inference speed to 58FPS (MPII) while maintaining parameter efficiency, demonstrating the efficiency of its feature filtering mechanism. When used alone, global attention (GA) has the smallest performance improvement (PCK only increases by 1.09–1.58%) and reduces FPS, but it complements other modules in the complete model. The combination of MS-DConv + LA significantly improves the PCK value of the SHN model. The complete model formed by the fusion of these three components not only reduces the number of parameters, but also verifies the collaborative optimization effect of each component in feature extraction, local enhancement, and global integration. The improved method for human pose estimation using the SHN proposed in the study was analyzed for local martial arts action recognition. The comparison indicators were accuracy and complexity, and the results are shown in Table 2.

Table 2.

Ablation results under human posture recognition.

Structure	ACC (%)		FLOPs (G)
Structure	MPII	NTU RGB + D 60	MPII	NTU RGB + D 60
MS-DConv-Att-SHN	86.06	82.52	39	42
MS-DConv-Att-SHN + local feature refinement	87.54	85.42	45	49
MS-DConv-Att-SHN + channel fusion	88.71	87.29	68	67
Improve MS-DConv-Att-SHN	89.74	88.56	70	71

In Table 2, the MS-DConv-Att-SHN model demonstrated an improvement in recognition accuracy and computational complexity after incorporating local features and channel fusion. Its recognition accuracy on two datasets was 89.74% and 88.56%, respectively, and the complexity values were 70G and 71G. This indicated that the model had good classification performance and computational effectiveness. Afterward, the improved MS-DConv-Att-SHN algorithm proposed in the study was compared with resolution-aware network fusion methods (RNF), Random Forest Pose Recognition Algorithm (RF-PR), Spatial Temporal Graph Convolutional Network (ST-GCN), and PoseC3D for keypoint recognition performance. The results are shown in Figure 9.

Figure 9.

Key point recognition effect.

In Figure 9(a), the human pose keypoint recognition of the RF-PR model was poor, with an average value of no more than 85%. Meanwhile, the average recognition accuracy of the ST-GCN, RNF, and PoseC3D models differed from that of the research model by 3.24%, 2.16%, and 1.37%, respectively. The keypoint recognition performance of the fusion model proposed in the study was significantly better than that of other models. In Figure 9(b), the recognition performance of the comparative model decreased, and it was more significantly affected by optical interference from the information contained in the dataset. The research model still showed good keypoint recognition performance, with an average value exceeding 84%.

Martial arts action posture estimation and action recognition results

The proposed martial arts action recognition method was tested on a self-made martial arts dataset, and Figure 10 was obtained.

Figure 10.

Recognition results of different continuous martial arts movements.

In Figure 10, the recognition accuracy of the research model for different martial arts movements exceeded 90%, while its recognition error values did not exceed 0.012%, indicating its good application effect. The research model achieved an accuracy of over 90% and an error of less than 0.012% in martial arts action recognition, indicating its high reliability and ability to meet the needs of professional martial arts training and competition scoring. For example, in the automated scoring system for martial arts routines, the model can accurately identify subtle differences in movements such as “bright palm,” reducing subjective bias in manual evaluation. In addition, the extremely low error rate makes it suitable for high-precision movement correction, such as posture adjustment in Tai Chi teaching, to avoid the formation of muscle memory due to incorrect movements. This performance advantage provides reliable technical support for the standardized promotion of martial arts. Afterward, the proposed martial arts action recognition method was compared with Convolutional Neural Network Gated Recurrent Unit Integrated Model (CNN-GRU), Feature Alignment with Adaptive Weighting Model (FA-AWM), Hidden Markov Model with Viterbi Decoding (HMM Viterbi), Multi-Modal Spatio Temporal Network (MMSTNet), and Ranking Graph Convolutional Network (Rank GCN). The results are shown in Table 3.

Table 3.

Recognition results of martial arts actions using different algorithms.

Model	Parameter quantity (M)	PCK (%)	PAP	mAP (%)	FPS	ACC (%)	FLOPs (G)
CNN-GRU	8.21	82.33	0.78	75.65	45	88.22	32
FA-AWM	5.70	85.15	0.82	78.33	42	87.92	39
HMM Viterbi	12.1	86.80	0.71	70.21	58	89.42	33
MMSTNet	8.42	85.61	0.65	81.92	41	91.17	37
Rank GCN	6.30	86.22	0.80	87.54	43	94.23	43
Improve MS-DConv-Att-SHN	3.73	92.76	0.92	95.36	57	96.31	66

In Table 3, MMSTNet and the research model had the highest and lowest parameter counts, at 8.42 M and 3.73 M, respectively. The research model performed the best in PCK and Percentage of Correct Parts (PCP) indicators, with values of 92.76% and 0.92, respectively. The second best performing models were the Rank GCN model and the HMM Viterbi model, with corresponding PCK values of 86.80% and 86.22%. In terms of efficiency indicators, HMM Viterbi had the best real-time performance (58 FPS), but its recognition accuracy did not exceed 90%, while the CNN-GRU model had the smallest computational cost (32 G) and was suitable for edge devices. In terms of comprehensive indicators, the improved MS-DConv-Att-SHN model proposed in the study had good martial arts action recognition performance, and it could balance recognition accuracy and real-time performance well. The computational efficiency of the FA-AWM model was relatively limited, and the temporal modeling ability of the HMM Viterbi model still needed to be strengthened. The martial arts action recognition results of the above methods on the MPII and NTU RGB + D 60 datasets were analyzed, and the results are shown in Table 4.

Table 4.

Recognition results of martial arts movements using different methods across datasets.

Model	Data set	PCK@0.5 (%)	PAP (%)	mAP@0.5 (%)	Parameter quantity (M)	Inference latency (ms)	Robustness of occlusion (ORI)
MS-DConv-Att-SHN (Ours)	MPII	91.2	93.5	90.1	24.3	14.8	0.85
MS-DConv-Att-SHN (Ours)	NTU RGB + D 60	89.7	92.3	88.5	24.3	15.2	0.81
CNN-GRU	MPII	83.5	86.2	82.0	38.7	27.9	0.65
CNN-GRU	NTU RGB + D 60	82.1	85.6	80.9	38.7	28.4	0.62
FA-AWM	MPII	85.8	88.9	84.3	45.2	22.1	0.71
FA-AWM	NTU RGB + D 60	84.5	88.2	83.1	45.2	22.7	0.68
HMM Viterbi	MPII	77.6	80.1	76.2	12.1	51.3	0.58
HMM Viterbi	NTU RGB + D 60	76.3	79.4	74.8	12.1	52.9	0.55
MMSTNet	MPII	87.4	90.3	86.5	67.8	33.8	0.76
MMSTNet	NTU RGB + D 60	86.2	89.7	85.3	67.8	34.6	0.73
Rank GCN	MPII	88.7	91.2	87.6	31.5	19.3	0.80
Rank GCN	NTU RGB + D 60	87.4	90.1	86.2	31.5	19.8	0.77

In Table 4, the improved MS-DConv Att-SHN model proposed by the study showed good accuracy performance on two datasets, and its performance on the MPII dataset was PCK@0.5 Reaching 91.2%, an improvement of 2.5% compared to the Rank GCN model (88.7%), and its performance on the NTU RGB + D 60 dataset mAP@0.5 At 88.5%, it is better than the Rank GCN model (86.2%), mainly due to the multi-scale feature fusion of multiple dilated convolutions and the enhancement of local poses by mixed attention, and its more accurate temporal modeling of fast and continuous movements in martial arts. The performance of the HMM Viterbi model in martial arts action recognition is the worst, as its Markov assumption is difficult to adapt to the nonlinear temporal characteristics of martial arts actions. The parameter count of the improved MS-DConv-Att-SHN model (24.3 M) is only 36% of that of the MMSTNet model, and the inference latency on the MPII dataset (14.8 ms) is reduced by 23% compared to the Rank GCN model, which ranks second in real-time (19.3 ms), thanks to the sparse computation of depthwise separable convolution and local attention. Although the FA-AWM model has a low latency (MPII: 22.1 ms), the parameter count (45.2 M) is too large for practical deployment. Meanwhile, the research model achieves limb subgraph decomposition through its local spatiotemporal feature module, which has good occlusion robustness. Its ORI value (0.85) is higher than that of the Rank GCN model (0.80). The CNN-GRU model has the lowest ORI (NTU: 0.62) due to the lack of explicit occlusion processing mechanism. MMSTNet performs well on multi-modal (RGB + depth) NTU datasets (mAP 85.3%), but lacks adaptability to single modal MPII data, reflecting its strong modal dependence. The Rank GCN model lacks the ability to update the dynamic graph of rapid changes in martial arts moves. The improved MS-DConv-Att-SHN model improves PCK stability by 12% in high-speed actions such as “high kicking” through inter frame differencing of motion branches and channel attention weighting. The recognition results of the above methods under different martial arts movements were analyzed, and the results are shown in Table 5.

Table 5.

Martial arts action recognition results (PCK) of different algorithms.

Model	Clasping	Crossing palms	Splitting palms	Shining palms	Counterpunching	Grasping	Punching	Racking	Penetrating palms
CNN-GRU	92.75	89.32	80.81	80.02	83.14	75.75	70.73	88.49	80
FA-AWM	93.92	90.78	84.41	82.7	86.5	82.69	75.33	89.97	83.6
HMM Viterbi	93.88	91.62	84.07	74.58	88.05	81.89	74.66	90.81	83.26
MMSTNet	91.46	92.74	86.37	86.29	89.67	84.04	80.68	91.93	85.56
Rank GCN	92.87	93.98	88.86	91.91	91.36	88.69	84.85	93.17	88.05
Research model	95.95	94.1	89.22	97.73	92.17	88.35	86.96	93.29	88.41

In Table 5, the PCK values of the research model for fist hugging, cross palm, split palm, bright palm, opposing palm, grid fist, punching, fighting, and piercing palm movements all exceeded 85%, and the maximum value approached 97.73%. The bright palm movement has a typical three-dimensional spatial feature of “palm abduction forearm rotation elbow slight flexion,” and the spatial distribution of its key nodes presents high discrimination. And the peak angular velocity of its bright palm movement (120°/s) is lower than that of punching (300°/s), the degree of motion blur is lower, and the rigid motion characteristics of the palm root wrist joint simplify trajectory prediction. Therefore, the expansion convolution of the research model can better capture the radial features of palm expansion, and its channel attention mechanism can improve the feature weights of palm regions, resulting in better control of key point detection errors for geometric features. The MMSTNet model and Rank GCN model, which performed well, had a maximum PCK recognition value of no more than 95% for different actions, and their values for punching actions were only 80.68% and 84.85%, respectively. The CNN-GRU model and FA-AWM model only performed well in recognizing fist hugging, cross palm, split palm, and bright palm movements, but their overall recognition performance was far inferior to other comparative methods. MMSTNet and Rank GCN perform poorly in some actions (such as “punching”) (PCK<85%), revealing their shortcomings in adapting to fast linear attack actions. However, the research model still maintains high accuracy in such actions (PCK>92%), proving that it is more suitable for practical action analysis. CNN-GRU and FA-AWM only perform well in static postures (such as “fist hugging”), indicating that traditional temporal models lack adaptability to dynamic martial arts. This comparison highlights the technological superiority of the research model in complex martial arts scenarios (rapid change of moves, transition between offense and defense), providing a better solution for competitive martial arts training and tactical analysis. Afterward, the real-time performance of different models was compared on a self-made dataset, and the results are shown in Figure 11.

Figure 11.

Real-time performance results of different models on a self-made dataset.

In Figure 11, simple actions refer to static or low-speed actions (such as “boxing” and “clapping”), while complex actions refer to high-speed or full body coordinated actions (such as “whirlwind feet” and “flying feet”). Delay amplification refers to the percentage increase in delay between occluded scenes and normal scenes. In the results of Figure 11(a), the research model only increased latency by 22.6% (12.4 → 15.2 ms) when the action category increased from 10 to 30, significantly better than the MMSTNet model (+20.8%) and CNN-GRU model (+21.4%). Moreover, the delay amplification of the research model under occlusion (22.3%) is smaller than other comparison models, and the FPS value (82) is 1.4 times that of the Rank GCN model. The CNN-GRU model has a significant increase in latency due to full sequence recalculation, and the Rank GCN model may have asynchronous action feedback. Afterward, the confusion matrix was used to analyze the martial arts action recognition results before and after the improvement of the research method, and the results are shown in Figure 12.

Figure 12.

Confusion matrix of martial arts action recognition before and after the improvement of SHN.

In Figure 12, there was a significant difference in the recognition results of martial arts movements before and after the improvement of the SHN. The recognition accuracy of the SHN for fist hugging, cross palm, split palm, bright palm, pair fist, grid fist, punch, frame fight, and piercing palm movements are 81.26%, 82.33%, 78.54%, 84.25%, 71.25%, 88.37%, 89.15%, 78.33%, and 74.36%, respectively, with the maximum value not exceeding 90%. The improved MS-DConv-Att-SHN network proposed in the study achieved recognition accuracies of 90.91%, 92.13%, 94.23%, 96.28%, 98.12%, 98.34%, 98.21%, 97.95%, and 83.26% in the aforementioned martial arts movements, demonstrating good application performance. Afterward, the martial arts action recognition system designed for research was tested and analyzed by inviting 10 martial arts beginners to learn based on the provided “Chinese Boxing” learning video. The test content included five basic movements of the twelve types of Chinese Boxing (punching, clapping, archery, horse step, and whirlwind foot), and then recording the learning results separately. Kinect v3 + 4K camera synchronized recording, imported the learning results into the martial arts action recognition system, analyzed its application effect, and compared it with the manual recognition results (inviting 2 national level martial arts judges to independently score and take the average). The results are shown in Table 6.

Table 6.

Comparison results of martial arts action recognition system and manual evaluation.

Evaluation indicators	The system automatically recognizes the results	Professional coach evaluation results	Error/Variance rate	Significance test (p-value)
Action completion score (on a 10 point scale)	8.42 ± 0.63	8.57 ± 0.72	−1.75%	0.082
Keyframe matching rate	92.3%	94.1%	−1.8%	0.047
Typical number of error detections (per person)	3.2 ± 0.8	3.5 ± 1.1	−8.6%	0.112
Joint angle deviation (°)	4.7 ± 1.3	—	—	—
Response time (seconds/action)	0.8 ± 0.2	3.5 ± 1.4	+77.1%↑	<0.001
Continuous action coherence evaluation (level 5)	4.1 ± 0.5	4.3 ± 0.6	−4.7%	0.064

In the results of Table 6, the difference rate between the system score and manual evaluation is less than 5%, and there is no statistically significant difference in action completion (p = 0.082) and coherence (p = 0.064). The system response speed (0.8 s) is faster than manual speed (3.5 s), making it suitable for high-frequency action correction. And the system can automatically generate error reports, including posture deviation, rhythm errors, and other situations.

Discussion

To improve the limitations of traditional martial arts human pose recognition estimation, the research proposed to improve the SHN network and designed a martial arts action recognition system that includes data collection, pose estimation, and other content. The results indicated that in the ablation experiment, multiple dilated convolutions could improve the detection accuracy and inference speed of SHN, and the difference in PCK index values between the two datasets exceeded 2.5%. The detection accuracy exhibited by the expansion scale of (2,3) under the same number of stacks was greater than that of (3,5). The reason was that an excessively large expansion scale could cause zero padding in the convolution, resulting in missing feature learning and limited detection accuracy. The PCK values of the MS-DConv-Att-SHN model in the two datasets were 92.49% and 93.84%, respectively, with mAP values exceeding 85%. This reduced the number of parameters and computational complexity. Compared with the resolution-aware network proposed by Xu,¹⁰ the research method not only improved low-resolution adaptability but also enhanced multi-scale feature extraction ability through dilated convolution. It enabled joint detection to maintain high accuracy even under occlusion and motion blur. In addition, although Chen’s CNN-GRU framework¹⁸ could achieve a recognition rate of 90%, its mean square error (<3) was still higher than the PCK value of the research method (>92%). This indicated that improving SHN had more advantages in spatiotemporal feature modeling. In the recognition results of key points in human posture, the improved MS-DConv-Att-SHN algorithm had better recognition performance than other comparative models. The average recognition accuracy of the RF-PR model did not exceed 85%, and the average recognition accuracy of ST-GCN and RNF differed from that of the research model by 3.24% and 2.16%, respectively. The reason for this result was that MS-DConv-Att-SHN constructed a more efficient multi-scale feature extraction network through multiple dilated convolutions. Compared with the fixed receptive field of ST-GCN and the resolution adaptive mechanism of RNF, this design could simultaneously capture local details and global contextual information, which was suitable for the common fast movements and large posture changes in martial arts movements. Moreover, the introduced attention module effectively solved the performance degradation problem of traditional methods in occlusion situations. Compared with manual feature-based algorithms such as RF-PR and PoseC3D’s 3D convolution features, it could highlight the feature representation of key joints.

MS-DConv-Att-SHN maintained high accuracy (PCK > 92%) while having lower computational complexity than temporal models such as ST-GCN and PoseC3D, effectively avoiding the complex graph structure calculations of ST-GCN. Compared to PoseC3D’s 3D convolution, adopting its improved structure could maintain high precision without increasing too many parameters. The recognition accuracy of the research model on continuous martial arts movements exceeded 90%, which was superior to other comparative models and could achieve a good balance between recognition accuracy and real-time performance. The PCK values of the Rank GCN model and the HMM Viterbi model did not exceed 87%. The recognition accuracy of the HMM Viterbi model did not exceed 90%. The research model effectively avoided the dependence of traditional methods (HMM Viterbi model7⁷) on complex temporal modeling while maintaining a continuous action recognition rate of over 90%. The research method was consistent with the goal of C. Papic’s neural network motion analysis,¹¹ but the study further balanced accuracy and speed through lightweight design, filling the gap in real-time performance of Chen’s random forest method.¹²

In the specific recognition of martial arts movements, the PCK values of the research model for fist hugging, cross palm, split palm, bright palm, counter fist, grid fist, punching, fighting, and piercing palm movements all exceeded 85%. However, the MMSTNet model and Rank GCN model only achieved values of 80.68% and 84.85% in punching actions, while the CNN-GRU model and FA-AWM model had poor overall recognition performance. The reason for the above results was that martial arts movements focused more on coherence and contained more subtle movements, which made it difficult for algorithms that ignore temporal features (such as CNN-GRU and HMM Viterbi) to perform well in recognition. Martial arts movements had unique motion patterns (such as specific force application methods and movement trajectories), and the MS-DConv-Att-SHN model had better adaptability. The dilated convolution was suitable for capturing long-distance spatial dependencies in martial arts movements, and the mixed attention mechanism could enhance the feature expression of specific key points (such as wrists and elbows) in typical martial arts movements, breaking through the performance bottleneck of manual feature methods in RF-PR models. J. Echeverria’s pose sequence analysis¹⁵ and Pang’s feature alignment technique¹⁶ emphasized temporal correlation. The research results showed that the use of multi-convolution-attention mechanism could partially replace complex temporal models, providing new ideas for pose estimation.

Conclusion

The improved MS-DConv-Att-SHN model proposed in the study had good martial arts action recognition performance and application performance. Its improvement on the SHN network could effectively balance accuracy performance and computational efficiency, providing a feasible solution for real-time interactive systems. The improved model proposed in the study can effectively calculate the standard degree of movements and can be deployed in provincial martial arts championships to achieve automatic scoring and slow motion replay analysis, thereby improving the efficiency of manual scoring. And the local feature refinement module in the research model can effectively detect the posture and joint angles of learners, which helps to achieve intelligent teaching correction effects. However, there is still room for improvement in research methods, such as the diversity of martial arts movements and tracking of joint angles. Specifically, the self-made dataset selected for the research model does not cover some unconventional movements of characteristic schools (such as the “lying down fist” rolling movement), and lacks modeling of joint instrument interaction features of instrument martial arts, resulting in a small sample size in the training dataset. Moreover, studying joint data based on posture analysis results can easily overlook the temporal joint drift caused by motion blur during fast rotating martial arts movements. In the future, efforts will be made to introduce interactive virtual technology, model simplification technology, and other techniques to improve the real-time and applicability of research model applications. Specifically, dedicated model compression techniques (channel pruning + hybrid quantization) will be developed or step wise spatiotemporal convolution modules will be designed to optimize model lightweighting and real-time performance, reducing the accumulation of errors in long action sequences. Develop a biomechanical analysis module, integrate millimeter wave radar and electromyographic signals, and construct a multi person interactive recognition system to enhance the dynamic environmental adaptability of the model. At the same time, research can also consider more types of martial arts movements and video image shooting angles, attempting to construct a knowledge graph of martial arts movements to improve joint prediction performance and action recognition accuracy, and enhance model training efficiency.

ORCID iD

Zhigang Chen https://orcid.org/0009-0004-0458-5912

Footnotes

Funding

The author disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The research is supported by Xinjiang Hetian College, “Value Reconstruction of Ethnic Traditional Sports Events in Promoting Rural Revitalization” in Southern Xinjiang (Project No. 2025SK006).

Declaration of conflicting interests

The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

. Deep learning-based for human segmentation and tracking, 3D human pose estimation and action recognition on monocular video of MADS dataset. Multimed Tools Appl 2023; 82(14): 20771–20818.

Kumar

. A survey on intelligent human action recognition techniques. Multimed Tools Appl 2024; 83(17): 52653–52709.

Topham

Khan

Al-Jumeily

, et al. Human body pose estimation for gait identification: a comprehensive survey of datasets and models. ACM Comput Surv 2022; 55(6): 1–42.

Vikalwe Shakrani

Mathew Kanyangarara

Parowa

, et al. A deep learning model for face recognition in presence of mask. Acta inform Malays 2022; 6(2): 43–46.

Zhao

Guo

. Adaptive enhancement design of non-significant regions of a Wushu action 3D image based on the symmetric difference algorithm. Math Biosci Eng 2023; 20(8): 14793–14810.

. A method for recognising wrong actions of martial arts athletes based on keyframe extraction. Int J Biom 2024; 16(3–4): 256–271.

Husheng

. Martial arts moves recognition method based on visual image. J Inf Process Syst 2022; 18(6): 813–821.

Zhao

Lin

Sun

, et al. A review of state-of-the-art methodologies and applications in action recognition. Electronics-Switz 2024; 13(23): 1–39.

Zou

. Improving human pose estimation based on stacked hourglass network. Neural Process Lett 2023; 55(7): 9521–9544.

10.

Chen

Moreno-Noguer

, et al. 3D human pose, shape and texture from low-resolution images and videos. IEEE T Pattern Anal 2021; 44(9): 4490–4504.

11.

Papic

Sanders

Naemi

, et al. Improving data acquisition speed and accuracy in sport using neural networks. J Sport Sci 2021; 39(5): 513–522.

12.

Chen

. A wushu leg gesture recognition algorithm based on random forest and bone feature extraction (RF-SFE). Int J Hi Spe Ele Syst 2025; 34(1): 2540066.

13.

Cherepov

Eganov

Bakushin

, et al. Maintaining postural balance in martial arts athletes depending on coordination abilities. J Phys Educ Sport 2021; 21(6): 3427–3432.

14.

Yamei

Qiang

. Retracted article: dynamic light collection system based on human posture estimation application in martial arts action teaching simulation. Opt Quant Electron 2024; 56(3): 376.

15.

Echeverria

Santos

. Toward modeling psychomotor performance in karate combats using computer vision pose estimation. Sensors 2021; 21(24): 8378.

16.

Pang

Zhang

. Explainable quality assessment of effective aligned skeletal representations for martial arts movements by multi-machine learning decisions. Sci Rep 2025; 15(1): 323.

17.

Jia

Han

. Retracted article: visual system based on optical sensor in wushu training image trajectory simulation. Opt Quant Electron 2024; 56(4): 501.

18.

Chen

. An interpretable composite CNN and GRU for fine-grained martial arts motion modeling using big data analytics and machine learning. Soft Comput 2024; 28(3): 2223–2243.

19.

Hui

. Visualization system of martial arts training action based on artificial intelligence algorithm. Soft Comput 2023; 1(12): 1–12.

20.

Lei

. Feature extraction-based fitness characteristics and kinesiology of wushu Sanda athletes in university analysis. Math Probl Eng 2022; 1: 5286730.

21.

Liu

Yang

, et al. Recognition of TaeKwonDo kicking techniques based on accelerometer sensors. Heliyon 2024; 10(12): e32475.

22.

Cheng

Wang

. Construction of sports training management information system using AI action recognition. Sci Program 2022; 1: 8393612.

23.

Shang

. Advancing martial arts training: neural network-based recognition and assistance systems in biotechnological applications. J Commer Biotechnol 2024; 29(5): 84–94. DOI: 10.5912/jcb1913.

24.

Hrovatič

Peer

Štruc

, et al. Efficient ear alignment using a two‐stack hourglass network. IET Biom 2023; 12(2): 77–90.

25.

Wang

Zhang

, et al. Uniformer: unifying convolution and self-attention for visual recognition. IEEE T Pattern Anal 2023; 45(10): 12581–12600.

26.

Chen

Kong

, et al. A deep hourglass-structured fusion model for efficient single image dehazing. Multimed Tools Appl 2022; 81(24): 35247–35260.

27.

Luo

. Stacked hourglass networks based on polarized self-attention for human pose estimation. Proc Second IYSF Academic Symposium on Artificial Intelligence and Computer Engineering 2021; 12079: 543–548.

28.

Verma

Srivastava

. Two-stage multi-view deep network for 3D human pose reconstruction using images and its 2D joint heatmaps through enhanced stack-hourglass approach. Vis Comput 2022; 38(7): 2417–2430.

29.

Zhang

Bai

, et al. Animal pose estimation algorithm based on the lightweight stacked hourglass network. IEEE Access 2022; 11(1): 5314–5327.

30.

Gheisari

Hamidpour

Liu

, et al. Data mining techniques for web mining: a survey. Artif Intell Appl 2022; 1(1): 3–10.

31.

Huang

. Stacked attention hourglass network based robust facial landmark detection. Neural Netw 2023; 157: 323–335.

32.

Zhao

Wang

Gong

, et al. Estimating human pose efficiently by parallel pyramid networks. IEEE T Image Process 2021; 30(1): 6785–6800.

33.

Guo

Liu

, et al. Action status based novel relative feature representations for interaction recognition. Acta Electron Sin 2022; 31: 168–180.

34.

Zhang

, et al. A spatial attentive and temporal dilated(SATD)GCN for skeleton-based action recognition. J Intell Technol 2022; 7: 46–55. DOI: 10.1049/cit2.12012.

35.

Piatysotska

Podrіgalo

Romanenko

, et al. Comparative analysis of motor functional asymmetry indicators in athletes of cyclic sports, martial arts, and esports. Phys Educ Stud 2023; 27(4): 212–220.

36.

Manolachi

Chernozub

Tsos

, et al. Modeling the correction system of special kick training in mixed martial arts during selection fights. J Phys Educ Sport 2023; 23(8): 2203–22110.