Body posture recognition of aerobics based on improved OpenPose lightweight estimation algorithm

Abstract

As aerobics becomes an increasingly popular form of exercise, the demand for precise movement standards has grown. Traditional human pose recognition models no longer meet the practical requirements of aerobics scenarios. To address this, the study is based on open pose estimation technology, optimized with attention mechanisms and graph neural networks, and proposes a hybrid human pose recognition model. Performance validation and ablation experiments show that the model have an accuracy of 95% and a loss value as low as 0.0089. The highest score for human key points is 7.5, with angle error and position error reduced to 4.9% and 5.4%, respectively, outperforming the base algorithm. This highlights the success of the proposed optimization and enhancement techniques. In practical application comparison experiments, the recognition model achieves a running time of 8 ms when recognizing 150 images, significantly outperforming the comparison models. In multi-person recognition experiments, the proposed model reaches an accuracy of 93%. Additionally, the model shows superior performance in visualizing human pose recognition in practical scenarios. These results indicate that the model has high recognition accuracy and robustness, and can adapt to various real-world applications, meeting the high demands for human pose recognition in aerobics.

Keywords

aerobics human pose recognition OpenPose attention mechanism AGCN

Introduction

Aerobics is a lightweight sport that integrates entertainment, fitness, and leisure. It has gained widespread popularity due to its lack of restrictions in terms of venue, equipment, and fitness level. Aerobics is crucial for enhancing the public’s physical fitness. It not only strengthens muscle strength, improves rhythm, and fosters teamwork awareness, but also conveys rich cultural information through gymnastic dance, enhancing cultural confidence.^1,2 Due to its numerous benefits, experts and scholars domestically have conducted various levels of research on it. Masagca et al. studied the impact of aerobics on muscle endurance by conducting a two-and-a-half-month aerobics training experiment on over 100 university students without prior athletic experience. The results demonstrated significant improvements in their scores in push-ups, planks, and wall sit tests after the training.³ To explore whether aerobics has a positive effect on the performance of football players, Panihar et al. randomly divided a group of football players into two groups for a comparative experiment. Following the experiment, physical fitness tests were administered to both groups. The results revealed that the exercise group outperformed the control group in areas such as speed, agility, flexibility, and other fitness metrics.⁴ These studies indicate the positive impact of aerobics on improving public physical fitness and health.

The most widely used approach for human pose recognition relies on the dynamic movements of body joints. With the development of deep learning technology, methods using human skeletal key points for pose recognition have made significant progress.^5,6 For instance, Xu et al. addressed the issue of current human pose recognition models failing to capture all the poses needed during training and testing phases. They proposed a method to extract human skeletal data from multiple perspectives using deep learning networks, and verified its high accuracy in simulation experiments.⁷ Juang et al. innovatively proposed a classification system for four human postures: standing, bending, sitting, and lying. This system combines the advantages of Gaussian mixture models, contour intersection methods, and support vector machines, improving the interpretability and robustness of human posture classification.⁸ These studies optimize and improve human pose recognition using deep learning techniques, but still face limitations such as large model computation and high hardware requirements.

OpenPose technology marks a significant breakthrough in multi-person 2D pose estimation. It is capable of detecting key points across different body parts, including the hands, legs, and face, all at once. Additionally, it is compatible with multiple operating systems. OpenPose has found practical applications in areas such as human-computer interaction, security surveillance, and sports analysis.⁹ The open-source nature of OpenPose has attracted many experts and scholars both domestically and internationally to research and improve it. For example, Osawa and other scholars designed a remote rehabilitation system for elderly individuals who cannot undergo professional rehabilitation training in hospitals. This system uses a monocular camera to capture user motion images and applies OpenPose technology to recognize poses, providing professional remote guidance. Experimental results show that the system’s accuracy and feasibility meet practical application requirements.¹⁰ Hsiao et al. proposed a low-cost, markerless system for evaluating push-up movements based on OpenPose, Python code, and fuzzy inference techniques. The system demonstrated good reliability in experiments and is expected to be applied in rehabilitation exercises and industrial safety.¹¹ Jeongzh et al. addressed smoking behavior in non-smoking areas in public places by combining OpenPose technology with smoke detection hardware, proposing a smoking behavior recognition system. The system demonstrated a recognition rate of over 70%, providing a new approach for identifying smoking behavior in public spaces.¹² The development of automobile automation has raised higher demands for intelligent vehicle interiors. To address this, Walocha et al. used OpenPose and electrocardiogram data to capture driver states in real-time, dynamically adjusting driving modes or interior lighting to help drivers operate vehicles more safely and intelligently.¹³ Overall, OpenPose has been successfully applied in various fields and has achieved notable results. However, there is still a lack of research on applying OpenPose technology to aerobics.

In summary, although OpenPose technology is widely used in fields such as human-computer interaction and rehabilitation training, it has not been specifically adapted to the rhythmic and continuous movements unique to aerobics. As a result, it cannot accurately recognize the dynamic characteristics and standard postures of aerobics movements. Aerobics teaching and training scenarios often require multi-perspective and multi-person recognition simultaneously. However, the original OpenPose model has a large computational load and requires high-performance hardware, making it difficult to perform real-time motion analysis on ordinary terminals. This limits its widespread adoption in popular fitness settings. The research is based on OpenPose technology, integrating depthwise separable convolution, attention mechanism and adaptive graph convolutional network to construct an aerobics human pose recognition model. This technology integration fills the application gap of OpenPose in the field of aerobics, balancing computational efficiency and recognition accuracy, providing an innovative technical path for aerobics teaching and training, and has great practical value.

Methods and materials

Construction of multi-person pose estimation model based on OpenPose

OpenPose is a bottom-up human pose estimation method, which mainly includes the backbone feature extraction network, multi-stage iterative refinement module, and fusion parallel output section.^14,15 The core principle involves first identifying the key points of the human body in the target figure, and then infer the specific posture based on the relationships between these key points.^16,17 Due to its dual-stream joint prediction mechanism, this algorithm has proven effective in applications such as human-computer interaction, video surveillance, and more.^18,19 The specific implementation process is shown in Figure 1.

Figure 1.

OpenPose model for human pose recognition.

As shown in Figure 1, the OpenPose algorithm initially extracts multi-level features from the raw input image for human pose recognition. It outputs confidence maps and part affinity fields based on these features, and then uses these feature maps as input for the next stage to detect human key points and the relations between key points, thus recognizing the human pose. The output representation for the confidence maps and part affinity fields in the initial stage of the feature map is given by equation (1)

{\begin{cases} S^{1} = ρ^{1} (F) \\ L^{1} = ϕ^{1} (F) \end{cases}

(1)

In equation (1),

F

represents the feature map,

S^{1}

and

L^{1}

represent the confidence map and part affinity field output in the first stage, while

ρ^{1}

and

ϕ^{1}

represent the upper and lower layers of the first stage network. The expressions for the confidence map and part affinity field in

t

stage are shown in equation (1)

{\begin{cases} S^{t} = p^{t} (F, S^{t - 1}, L^{t - 1}), \forall t \geq 2 \\ L^{t} = ϕ^{t} (F, S^{t - 1}, L^{t - 1}), \forall t \geq 2 \end{cases}

(2)

In equation (2),

S^{t - 1}

and

L^{t - 1}

represent the confidence map and part affinity field output from the

t - 1

stage, while

p^{t}

and

ϕ^{t}

represent the upper and lower Convolutional Neural Networks (CNN) in the

t

stage. Through continuous iteration and refinement of the confidence maps and part affinity fields, the algorithm achieves high-precision human pose estimation results. The OpenPose algorithm has high robustness in human pose recognition, but the VGG-19 backbone network used in the original OpenPose algorithm contains 19 convolutional layers, leading to a large computational load, which impacts the efficiency and precision of feature map extraction.^20,21 Depthwise separable convolutions, however, can reduce computational load significantly while maintaining feature extraction quality. To reduce the computational burden of the OpenPose algorithm and enhance both the efficiency and quality of feature extraction, the study substitutes the convolutional layers in the pre-feature extraction stage with depthwise separable convolution layers, as given by Figure 2.

Figure 2.

Depthwise separable convolution layer workflow.

As depicted in Figure 2, the depthwise separable convolution layer is composed of two components: depthwise and pointwise convolution. To ensure that the depthwise convolution effectively extracts features in a higher-dimensional space, the study uses pointwise convolution before the depthwise convolution for dimensionality expansion. After expanding the dimensions, the channels can be extended to any appropriate size, allowing the depthwise convolution to capture more effective information. The convolution parameter scale is shown in equation (3)

{\begin{cases} S_{l a y e r} = S_{i n p u t} * S_{d w c} * S_{p w c} * N_{f i l t e r} \\ S_{p w c} \infty S_{i n p u t} \end{cases}

(3)

In equation (3),

S_{l a y e r}

represents the current layer’s parameter scale,

S_{i n p u t}

represents the input feature map’s size,

S_{d w c}

and

S_{p w c}

represent the channel kernel size and kernel size, and

N_{f i l t e r}

represents the number of output attributes. The pointwise convolution helps the depthwise convolution extract image features more effectively. Additionally, the original OpenPose algorithm uses the ReLu function as the activation function, which can cause the loss of depth convolution kernels. Therefore, the study replaces the ReLu function with the h-swish function, as shown in equation (4)

h - s w i s h [x] = x \frac{Re L u (x + 3)}{6}

(4)

After performing feature extraction within the same layer, the challenge of minimizing interference features while emphasizing the capture of important features emerges. The study addresses this issue by incorporating an attention mechanism, which learns the importance weights of each feature map channel to further suppress the interference from insignificant channels or pixels.^22,23 However, the attention mechanism requires weight calculation in the fully connected layer, which increases the model’s parameter count and computational dimensions, thus impacting the model’s running speed. To minimize the time cost of training and inference, the study introduces an ultra-lightweight subspace attention mechanism (ULSAM), which attempts to improve the neural network’s efficiency using subspace attention mechanisms, thus promoting the application and development of compact CNNs. The structure of ULSAM is shown in Figure 3.

Figure 3.

Schematic diagram of the structure of ULSAM.

As shown in Figure 3, ULSAM mainly consists of three parts: the input, channel attention mechanism, and output layer. The input layer receives the human pose feature maps extracted by convolutional layers, and the channel attention mechanism layer treats feature maps extracted from convolution kernels at different scales as subspaces. It then learns the importance weights for each subspace feature and performs weighted adjustment on the feature maps at the output layer to enhance the feature representation. The final output is a set of feature maps with different importance values. The ULSAM processing ensures focus on important information and prevents interference from irrelevant features. The expression for attention assignment of one group of subspace features is shown in equation (5)

C_{i} = s o f t \max (P W (\max p o o l (D W (F_{i}))))

(5)

In equation (5),

F_{i}

represents a group of feature maps,

D W

represents the depthwise convolution operation,

\max p o o l

represents the max pooling operation,

P W

represents the pointwise convolution operation,

s o f t \max

represents the normalization operation, and

C_{i}

represents the final set of attention maps. After obtaining the attention maps, the weight distribution of subspace feature attention is calculated, as shown in equation (6)

F_{i} = (C_{i} \otimes F_{i}) \oplus F_{i}

(6)

As shown in equation (6), the attention feature map $C_{i}$ obtained from equation (5) is multiplied by $F_{i}$ to obtain the subspace feature attention weights. These weights are then added to the original $F_{i}$ , producing the optimized feature map. Finally, all the optimized feature maps are concatenated, as shown in equation (7)

F = c o n c a t ([F_{1,} F_{2}, \cdot \cdot \cdot, F_{i}, \cdot \cdot \cdot, F_{n}])

(7)

After concatenating all the feature maps using equation (7), the final attention feature map group $F$ is obtained. Through the ULSAM processing, the expressive ability and diversity of the feature maps are improved, thus dynamically adjusting the feature maps. Considering that aerobics often involves multiple people, using only ULSAM for complex data processing cannot fully exploit the relationships between data, leading to insufficient expression in multi-person pose estimation. Therefore, the study adds deep connections between adjacent ULSAMs to better model the features, and finally combines depthwise separable convolution layers, ULSAM, and deep connections into the backbone network under the OpenPose framework to construct a pose estimation model for aerobics, named DC-ULSAM. The overall structure of the DC-ULSAM model is given by Figure 4.

Figure 4.

Schematic diagram of DC-ULSAM model structure.

As illustrated in Figure 4, the DC-ULSAM model for aerobics pose estimation begins by extracting features from the original image, generating a set of feature maps. These maps are then passed through convolutional neural networks for an initial estimation. In subsequent stage, the estimation results from the previous stage are combined with the feature maps, which leads to a more refined final estimation. After multiple iterations, confidence maps for key points and part affinity fields are generated, and human pose skeleton diagrams are generated based on these key point confidence maps and part affinity fields.

Construction of multi-person pose recognition model for aerobics

To address the accuracy bottleneck and insufficient generalization of action classification in traditional aerobics human pose recognition models in multi-person scenarios, a multi-person pose estimation model based on OpenPose is proposed in this study. On this basis, in order to improve the classification performance of the model for diverse aerobics movements, the Adaptive Graph Convolutional Networks (AGCN) were introduced in the study to adapt to different types of data.^24,25 Therefore, based on AGCN, the study extracts keypoint coordinates, bone lengths, directions, and motion information from the human skeleton map. The Space-Time-Channel (STC) attention module is incorporated to improve the extraction of key features. By combining AGCN with the STC attention module, a human pose recognition model called AGCN-STC is developed.²⁶ The structure of this recognition model is depicted in Figure 5.

Figure 5.

Structure of the AGCN-STC recognition model.

In Figure 5, the AGCN-STC action recognition model first preprocesses the pose information upon receiving the human pose skeleton map, removing redundant pose information. The preprocessing method used is to deduplicate the pose information through the Non-Maximum Suppression. The last pose information of each pose is first obtained, as shown in equation (8)

N_{m a p} {\begin{cases} 0, U_{I o U} (M, N_{i n}) \geq N_{t} \\ N_{s c o r e}, U_{I o U} (M, N_{i n}) < N_{t} \end{cases}

(8)

In equation (8),

N_{i n}

and

M

represent the target pose and the highest-scoring pose,

N_{s c o r e}

and

N_{m a p}

represent the original score and final score of the pose, and

N_{t}

is the preset score threshold. After obtaining the final score for each pose, the study uses the Max function to deduplicate the poses, as shown in equation (9)

N_{m s} = Max (N_{m a p}, P_{f o o t})

(9)

In equation (9),

P_{f o o t}

represents the local optimization window, and

N_{m s}

represents the final output value. The preprocessed pose information is then fed into three graph convolutional networks for feature extraction: the joint feature input flow, the bone feature input flow, and the motion time information flow. The joint feature input flow extracts joint coordinates, the bone feature input flow extracts bone lengths and directions, and the motion time information flow extracts motion information based on global attributes.²⁷ After each graph convolutional network layer, the Softmax function is applied to score the output information flow, and the weighted sum of the scores is used to obtain the final human pose recognition result. In the bone feature extraction step, the bone vector between joints is defined to represent the bone length and direction between two joints. The process is as follows: first, the centroid of the skeleton is determined, and the joint closer to the centroid is defined as the source joint, while the other joint is defined as the target joint. The expression for the bone vector is shown in equation (10)

e_{i, j, t} = (x_{j, t} - x_{t, j}, y_{j, t} - y_{t, j}, z_{j, t} - z_{t, j})

(10)

In equation (10),

(x_{i, t}, y_{i, t}, z_{i, t})

and

(x_{j, t}, y_{j, t}, z_{j, t})

represent the coordinates of the source and target joints, respectively. By calculating the bone vectors between all joints, the bone feature map for the human body is obtained. In the action classification recognition step, to achieve the optimal recognition of different action class samples, the model decomposes the adjacency matrix of the graph into a weighted sum of two subgraphs, as shown in equation (11)

f_{o u t} = \sum_{k}^{k_{v}} W_{k} f_{i n} (B_{k} + α C_{k})

(11)

In equation (11),

B_{k}

represents the connection and connection strength between two joints,

C_{k}

represents the similarity between two joints,

α

represents the uniquely parameterized coefficient for each layer,

f_{i n}

and

k_{v}

represent the input feature map and regional partitioning of the network,

W_{k}

is the parameter for the

k

-th region, and

f_{o u t}

is the final feature map. By decomposing the subgraphs and applying weighted sums, the importance of each subgraph in each layer can be adaptively adjusted, making the model more flexible and versatile. Let the feature of node

v_{i}

in the graph convolutional network be

h_{i}

. The attention mechanism dynamically adjusts the feature aggregation process by calculating the correlation weight

e_{i j}

between nodes. The feature fusion formula is shown in equation (12)

e_{i j} = L e a k y Re L U (a^{T} [W h_{i} ‖ W h_{j}])

(12)

In equation (12),

h_{j}

represents the original feature vector of node

v_{j}

, and

a

represents the learnable attention parameter vector. To focus the model on important nodes, bones, and features, the study introduces the STC module in each layer of the graph convolutional network. The structure of the STC module is shown in Figure 6.

Figure 6.

STC attention module structure diagram.

As shown in Figure 6, the STC module is composed of three components: the spatial attention module, the temporal attention module, and the channel attention module. The spatial attention module enables the neural network to assign different levels of attention to different joints.²⁸ With the attention calculation expression shown in equation (13)

M_{s} = δ (g_{s} (A v g P o o l (f_{i n})))

(13)

In equation (13),

A v g P o o l

represents the average of the feature maps across all frames,

g_{s}

represents the one-dimensional convolution operation in space,

δ

is the sigmoid activation function, and

M_{t}

represents the calculated attention map. The computation for the attention module follows a similar approach to the spatial attention module, as shown in equation (14)

M_{t} = δ (g_{t} (A v g P o o l (f_{i n})))

(14)

In equation (14),

M_{t}

represents the calculated temporal attention map. The channel attention module enhances the feature discrimination of channels based on the input sample, with the attention map expressed as in equation (15)

M_{c} = θ (W_{2} (θ (W_{1} (A v g P o o l (f_{i n})))))

(15)

In equation (15),

W_{1}

and

W_{2}

represent the weights of two fully connected layers,

θ

represents the ReLU activation function, and the expression for the ReLU activation function is shown in equation (16)

f (x) = \max (0, x)

(16)

The role of the ReLU activation function is to limit the input values to a non-negative range. By introducing the STC module, the model can strengthen feature extraction from three directions: time, space, and channel, avoiding the omission of important information and improving the accuracy of human pose recognition. Finally, the study combines the DC-ULSAM human pose estimation module and the AGCN-STC recognition module to build a human pose recognition model based on OpenPose technology, named AA-OP. The specific structure of this recognition model is shown in Figure 7.

Figure 7.

AA-OP human pose recognition model structure diagram.

As shown in Figure 7, the AA-OP human pose recognition model first preprocesses the aerobics video to extract static frame images from the motion video, which serve as the input for the DC-ULSAM estimation module. The DC-ULSAM estimation module extracts features from the image and generates the human skeleton map. Upon receiving the skeleton map, the AGCN-STC recognition module extracts features from the skeleton map through joint feature input flow, bone feature input flow, and motion time information flow. Attention mechanisms are used to enhance the focus on important features, and the final human pose recognition result is obtained by weighted fusion. The AA-OP human pose recognition model is based on OpenPose technology and has been optimized to enhance its feature extraction capabilities and fully leverage the advantages of different data streams.

Results

Experimental setup

To validate the performance of the proposed AA-OP human pose recognition model for aerobics, the study conducted model performance verification experiments, ablation experiments, and comparison experiments. The dataset used in both the model performance verification and ablation experiments was the Human3.6 M dataset, which includes 17 different scene images, such as eating and exercising, providing a variety of human images with different poses and angles for model performance validation. The comparison experiment used a self-built dataset, AP. This dataset was captured using multiple cameras synchronously, featuring 20 professional aerobics athletes with different body types. The original video totaled 120 h. The AP dataset included 50 aerobics poses, such as rotation and jumping, and the scenes were shot in both gymnasiums and outdoor settings. The experimental environment is shown in Table 1.

Table 1.

Experimental environment parameters.

Disposition	Argument
CPU	I7-13700K
GPU	NVIDIA TitanX (Pascal) 12G × 2
OS	Ubuntu16.04
Deep learning framework	PyTorch1.5
Python environment	Python3.7.7

Performance verification of AA-OP human pose recognition model

Based on the above experimental setup, the study first performed model performance validation on the AA-OP human pose recognition model using the training and test sets from the Human3.6 M dataset as the target images. The evaluation metrics for model performance validation were accuracy and loss. The AA-OP model was tested for accuracy and loss on both the training and test sets, and the experimental results are shown in Figure 8.

Figure 8.

Accuracy and loss value experiment results.

As shown in Figure 8(a), with an increasing number of model iterations, the accuracy on both the test and training sets stabilized. After 150 iterations, the accuracy on the training set stabilized at 0.91, while the accuracy on the test set stabilized at 0.89, with the training set’s accuracy slightly higher than that of the test set. As shown in Figure 8(b), after 23 iterations, the loss value on the training set reached a steady state at 0.0072, and after 51 iterations, the loss value on the test set stabilized at 0.0089. These results demonstrate that the AA-OP model achieved good performance in terms of both accuracy and loss after training. To further verify the feasibility of the AA-OP model, the study conducted verification experiments on the light and moderately crowded image sets, with the results shown in Figure 9.

Figure 9.

Accuracy results for images with different levels of crowding.

In Figure 9(a), it can be seen that for lightly crowded images, the recognition difficulty was relatively low. The minimum accuracy on the test set was 0.75, and the maximum accuracy was 0.95. The minimum accuracy on the training set was 0.79, and the maximum accuracy was 0.98. In Figure 9(b), the recognition target was moderately crowded images, the accuracy on both the training and test sets decreased, but it remained above 0.7. The highest accuracy on the different sets reached 0.92 and 0.94. Overall, despite the increased difficulty of the recognition target, the AA-OP model still maintained high accuracy, proving the superiority of the model’s performance.

Ablation experiment of AA-OP model

To assess the impact of the improvements made to the OpenPose algorithm, the study conducted additional ablation experiments to confirm the effectiveness of these modifications and the superior performance of the AA-OP model. The ablation experiments used the Human3.6 M dataset, with evaluation metrics being keypoint scores and accuracy. First, the study conducted experiments on the keypoint scores for OpenPose, DC-ULSAM, AGCN-STC, and AA-OP models on images with different Signal-to-Noise Ratios (SNRs). The results of the keypoint score experiments are shown in Figure 10.

Figure 10.

Keypoint score experiment results.

As shown in Figure 10(a), when the SNR of the recognition image was 30 dB, indicating good image quality, the highest keypoint score for OpenPose was 6.3, for DC-ULSAM it was 6.4, for AGCN-STC it was 7.8, and for AA-OP it was 9.0. In Figure 10(b), when the SNR dropped to 10 dB, indicating lower image clarity, the highest keypoint score for OpenPose was 5.1, for DC-ULSAM it was 5.8, for AGCN-STC it was 6.0, and for AA-OP it was 7.8. These results demonstrate that, at different SNRs, the AA-OP model outperforms the other three models in terms of keypoint recognition, validating the effectiveness of the improvements made to the OpenPose algorithm for enhancing human pose recognition. After comparing the keypoint scores, the study continued with angle and position error experiments on the four models. The experimental results are shown in Figure 11.

Figure 11.

Angle and position error experiment results.

Figures 11(a), (b), (c), and (d) show the angle and position errors for OpenPose, DC-ULSAM, AGCN-STC, and AA-OP models when recognizing target human bodies. It can be seen that the recognition error for DC-ULSAM and AGCN-STC models did not vary significantly. Specifically, the average angle error for AGCN-STC was 10.1%, with a position error of 9.5%. For DC-ULSAM, the average angle error was 9.8%, with a position error of 9.7%. The average angle error for AA-OP was 4.9%, and the position error was 5.4%. The error variation for OpenPose was larger, with an average angle error of 14.3% and a position error of 14.1%. In comparison, AA-OP demonstrated higher and more stable recognition accuracy, showing superior practicality. This experiment further confirmed the effectiveness of the improvements made in the research.

Validation of AA-OP model in practical applications

To further validate the performance of the AA-OP model in real-world application scenarios, the study conducted comparison experiments with mainstream models. The comparison models included Cascaded Pyramid Network (CPN), Convolutional Pose Machines (CPM), and High-Resolution Network (HRNet). The AP dataset was used, and the study first conducted a comparison of the recognition times for the four models. The results are shown in Figure 12.

Figure 12.

Running time results under different SNRs.

In Figure 12(a), when the image clarity was relatively high, with an SNR of 30 dB, the recognition time for all four models increased as the number of images grew. When the number of images reached 150, the recognition times for CPN, CPM, HRNet, and AA-OP were 20 ms, 18 ms, 14 ms, and 8 ms, respectively, with AA-OP performing the best. In Figure 12(b), when the image clarity decreased, all four models experienced a decrease in recognition efficiency. However, AA-OP maintained the highest efficiency, with the recognition times for CPN, CPM, and HRNet increasing by 5 ms, 8 ms, and 8 ms, while AA-OP’s recognition time only increased by 4 ms. The experimental data indicate that AA-OP maintained a fast recognition speed even with target images of varying clarity, validating its superior robustness and adaptability. The study continued by conducting comparison experiments on the recognition accuracy for single-person and multi-person images across the four models. The experimental results are given by Figure 13.

Figure 13.

Recognition accuracy results for single-person and multi-person images.

As seen in Figure 13(a), when the target was a single person, the recognition accuracy for all four models exceeded 85%. The highest accuracy for CPN was 96.3%, for CPM it was 93.2%, for HRNet it was 92.1%, and for AA-OP it was 96.7%, making AA-OP the best-performing model. However, when the target was a multi-person image, the accuracy of the three comparison models dropped significantly, with the highest accuracy for CPN, CPM, and HRNet being 94.1%, 91.3%, and 86.2%, respectively. In contrast, the AA-OP model maintained an accuracy above 93%, with a highest accuracy of 94.8%. This demonstrated that the AA-OP model could maintain high accuracy even for multi-person scenarios, indicating its suitability for real-world aerobics applications. Finally, the study conducted a comparison experiment on the AA-OP model’s performance in real human pose recognition. The visual results are shown in Figure 14.

Figure 14.

Human pose recognition results.

From Figures 14(a) and 11(b), it can be observed that CPM and CPN failed to recognize details such as the feet and occluded parts of the human body in the image. In Figure 14(c), HRNet misidentified a baseball bat as part of the human body and erroneously recognized a human from the background audience as the target human, without filtering out interference. In contrast, the AA-OP model provided a more complete recognition, accurately recognizing only the pose of the baseball player, excluding interference from the audience and sports equipment. Moreover, AA-OP outperformed the comparison models in terms of handling fine details. This experiment demonstrated the superior recognition performance of the AA-OP model.

Discussion

The research optimized the AA-OP human pose recognition model based on the OpenPose algorithm to improve its applicability in aerobics. The experimental results show that the AA-OP model performs excellently in performance and ablation experiments. After training, the accuracy reaches 95% and the loss value is 0.0089. Both its key point score and Angle position error test results are superior to those of the original OpenPose algorithm. The AA-OP model can still maintain a key point score of 7.5 under low-quality images, demonstrating good robustness and environmental adaptability, which is mainly attributed to the introduced attention mechanism. Shuai’s research also enhanced the model’s adaptability by fusing image features through the attention mechanism.²⁹ In crowded multi-person scenarios, the AA-OP model also demonstrated superior recognition performance, with the highest accuracy reaching 96.7%. The visual results in real human pose recognition also showed superior recognition performance. This might be related to the use of CNNs, which aligns with Purohit’s view on the role of neural networks in pedestrian action recognition.³⁰ Lauer also employed deep neural networks to develop a pose estimation toolbox, which also showed high precision.³¹ In the comparison experiment on the running speed under different SNRs, the AA-OP model’s running times were 8 ms and 4 ms, faster than the other three comparison models. This was mainly due to the reasonable network structure of the AA-OP model, which was consistent with the research results of Samkari’s team on human pose recognition models.³² The results confirmed that the AA-OP model could meet the requirements of human pose recognition for aerobics, providing more professional assistance to aerobics enthusiasts, while also offering a new direction and insight for the field of human pose recognition.

The AA-OP model proposed in the study combines a lightweight network structure with an adaptive attention mechanism, breaking through the performance bottleneck of the traditional OpenPose algorithm in complex scenarios, while maintaining efficient recognition. In response to the professional needs of aerobics in practical applications, the AA-OP model achieved fine joint angle recognition through a hierarchical feature fusion strategy and captured continuous movements accurately in rhythmic actions. The research findings not only contribute to enhancing the scientific and standardized nature of mass sports and fitness, aligning with the United Nations Sustainable Development Goals (SDGs) of “ensuring healthy lifestyles for the well-being of people of all ages” (SDG 3) but also reduce computing resource consumption through lightweight technology, promoting green technological innovation. In response to “Taking Urgent Action to Address Climate Change and Its Impacts” (SDG 13).

Conclusion

Aerobics has become increasingly popular among the general public due to its many advantages and is now a mainstream fitness activity. Recognizing the movement poses in aerobics can help fitness enthusiasts perform their exercises more accurately. However, traditional human pose recognition models, limited by the quality of the target images, cannot meet the needs of human pose recognition in multi-person aerobics scenarios. Therefore, the study innovatively proposed an aerobics human pose recognition model based on OpenPose technology. The final experimental results showed that the model outperformed the basic algorithms in recognition performance. Additionally, the model demonstrated excellent recognition accuracy and robustness across different recognition scenarios. The results proved that the AA-OP aerobics pose recognition model proposed in the study can meet the requirements for human pose recognition in actual aerobics exercises. The operational efficiency of the model is closely related to the hardware configuration, but systematic tests on its performance in different hardware environments have not yet been conducted, which may affect its applicability on low-configuration devices. Although the model performs well in experimental scenarios, its recognition accuracy may decline in complex situations such as extreme lighting conditions, dense crowds or high-speed movement. In the future, research and test models need to improve their operational efficiency on different hardware platforms and explore lightweight improvement solutions. Meanwhile, by integrating multimodal data or time series modeling methods, the stability of the model in complex motion scenarios can be enhanced.

Footnotes

ORCID iD

Qingbao Wang

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Schumann

Feuerbacher

Sünkeler

, et al. Compatibility of concurrent aerobic and strength training for skeletal muscle size and function: an updated systematic review and meta-analysis. Sports Med 2022; 52(3): 601–612.

Vikalwe Shakrani

Mathew Kanyangarara

Parowa

, et al. A deep learning model for face recognition in presence of mask. Acta inform Malays 2022; 6(2): 43–46.

Masagca

RCE

. The effect of 10-week wholebody calisthenics training program on the muscular endurance of untrained collegiate students. J Hum Sport Exerc 2024; 19(14): 941–953.

Panihar

Rani

. The effect of calisthenics training on physical fitness parameters and sports specific skills of soccer players: a randomized controlled trial. Areh 2022; 36(2): 23–31.

Lan

, et al. Vision-based human pose estimation via deep learning: a survey. IEEE Trans Hum Mach Syst 2023; 53(1): 253–268.

Wang

Jia

Bai

, et al. Research on the location of railway train in tunnel based on factor graph optimization. Appl Comput Lett 2023; 7(1): 5–10.

Guo

H -C

. Robust abnormal human-posture recognition using OpenPose and multiview cross-information. IEEE Sens J 2023; 23(11): 12370–12379.

Juang

C -F

W -E

. Human posture classification using interpretable 3-D fuzzy body voxel features and hierarchical fuzzy classifiers. IEEE Trans Fuzzy Syst 2022; 30(12): 5405–5418.

Liu

Pan

. An intelligent playback control system adapted by body movements and facial expressions recognized by OpenPose and CNN. Multimed Tools Appl 2024; 83(10): 31139–31160.

10.

Osawa

You

Sun

, et al. Telerehabilitation system based on OpenPose and 3D reconstruction with monocular camera. Int J Robot Mechatron 2023; 35(3): 586–600.

11.

Hsiao

Liu

Lin

. Markerless motion evaluation via OpenPose and fuzzy activity evaluator. Int J Chin Inst Eng 2022; 45(8): 697–705.

12.

Jeong

. OpenPose based smoking gesture recognition system using artificial neural network. Teh glas 2023; 17(2): 251–259.

13.

Walocha

Drewitz

Ihme

. Activity and stress estimation based on OpenPose and electrocardiogram for user-focused Level-4-Vehicles. IEEE Trans Hum Mach Syst 2022; 52(4): 538–546.

14.

Dubey

Dixit

. A comprehensive survey on human pose estimation approaches. Multimed Syst 2022; 29(1): 167–195.

15.

Elugbadebo

Orunsolu

Akinyele

, et al. An efficient and secured graphical authentication system. Acta inform Malays 2022; 6(1): 17–21.

16.

Fang

H -S

Zhu

, et al. AlphaPose: whole-body regional multi-person pose estimation and tracking in real-time. Int IEEE IEEE T Pattern Anal 2023; 45(6): 7157–7173.

17.

Mubarak Suud

. An image processing approach for monitoring soil plowing based on drone rgb images. Big data Agr 2022; 5(1): 01–05.

18.

Hsieh

Y -Z

Meng

Y -H

. A video surveillance system for determining the sexual maturity of cobia. Int IEEE T Consum Electr 2024; 70(1): 484–495.

19.

Giri

Prasad Chimouriya

Ram Ghimire

. Crossing strokes examination from cromaticity diagram. Sci Herit J 2023; 7(1): 01–08.

20.

Zeng

, et al. Multi-person pose estimation based on graph grouping optimization. Multimed Tools Appl 2023; 82(5): 7039–7053.

21.

Xiang

Pan

, et al. FFTCA: a feature fusion mechanism based on fast fourier transform for rapid classification of apple damage and real-time sorting by robots. Food Bioproc Tech 2025; 18(2): 1631–1655.

22.

Guo

Liu

, et al. Attention mechanisms in computer vision: a survey. Comput Vis Media (Beijing) 2022; 8(3): 331–368.

23.

Zulqarnain

Ghazali

Aamir

, et al. An efficient two-state GRU based on feature attention mechanism for sentiment analysis. Multimed Tools Appl 2024; 83(1): 3085–3110.

24.

Preethi

Mamatha

. Region-based convolutional neural network for segmenting text in epigraphical images. Artif Intell Appl 2023; 1(2): 119–127.

25.

Periakaruppan

Shanmugapriya

Sivan

. Self-attention generative adversarial capsule network optimized with atomic orbital search algorithm based sentiment analysis for online product recommendation. Int J Intell Fuzzy Syst 2023; 44(6): 9347–9362.

26.

Garg

Saxena

Gupta

. Yoga pose classification: a CNN and MediaPipe inspired deep learning approach for real-world application. Int J Amb Intel Hum Comp 2023; 14(12): 16551–16562.

27.

Ren

Wang

, et al. Automated matching of ancient bone stick fragments: integrating Siamese networks with sequence similarity metrics for multi-feature fusion. IEEE Access 2025; 13(1): 49527–49542.

28.

Wang

, et al. Efficient channel-temporal attention for boosting RF fingerprinting. IEEE Open J Signal Process 2024; 5(1): 478–492.

29.

Shuai

Liu

. Adaptive multi-view and temporal fusing transformer for 3D human pose estimation. Int. IEEE T Pattern Anal 2023; 45(4): 4122–4135.

30.

Purohit

Dave

. Leveraging deep learning techniques to obtain efficacious segmentation results. Int AAES 2023; 1(1): 11–26.

31.

Lauer

Zhou

, et al. Multi-animal pose estimation, identification and tracking with DeepLabCut. Nat Methods 2022; 19(4): 496–504.

32.

Gupta

K Pathak

, et al. Human activity recognition in artificial intelligence framework: a narrative review. Artif Intell Rev 2022; 55(6): 4755–4808.