Deep learning-based badminton action recognition and quality assessment

Abstract

Most people primarily rely on subjective opinions and action assessment from others in the process of learning badminton, which leads to biases and unreliable assessment of player performance. This paper presents a method of deep learning-based recognition and quality assessment of badminton actions to accurately identify player actions and assess player performance. In this research, we construct a video dataset of standard badminton actions for training networks and design a human pose estimation and tracking network to detect keypoints of individual players in the video dataset and track their trajectories. Furthermore, badminton action recognition is carried out based on the SlowFast network framework, and a Siamese network with it as the backbone network is proposed for automating the quality assessment of badminton actions. Experimental results demonstrate that the mean average precision (mAP) of human body pose estimation reaches 83.2%, the Multiple Object Tracking Accuracy (MOTA) of pose detection and recognition in badminton games reaches 81.4%, and the Multiple Object Tracking Precision (MOTP) reaches 90.7%. The accuracy of professional players in identifying badminton strokes is 83.08% for Top-1 and 96.89% for Top-3. Therefore, the proposed method can be effectively applied to badminton action recognition and quality assessment.

Keywords

Deep learning badminton action recognition action quality assessment

1. Introduction

Badminton is a prevalent racquet sport, with a significant number of people getting involved every year. However, during the learning process, placing excessive reliance on subjective opinions provided by coaches can lead to biases and unreliable assessment of player performance. This results in significant variations in the basic badminton actions among different players. In reality, during the early stages of learning badminton, the correctness and standardization of movements are crucial, as they will determine the potential for future improvements in technical skills. Therefore, there is a need for an objective and reliable method to assess badminton actions in order to enhance training and teaching effectiveness.

Deep learning has increasingly been applied in sports science, offering promising solutions for action recognition and quality assessment.^1–3 Due to its capacity to identify patterns within data using multi-layered neural networks, deep learning is particularly effective at evaluating the actions of badminton players. This allows for objective and quantifiable assessments of performance, which can be used to provide personalized feedback and improve player development. By enabling access to more accurate and detailed information, deep learning has the potential to revolutionize the field of badminton training and teaching.

However, deep learning-based action recognition for badminton faces the challenge of insufficient annotated data, which is crucial for training models and ensuring accuracy and generalization ability. The types of badminton action data primarily encompass data derived from wearable sensors and video-based data. While wearable sensor data holds potential in badminton action recognition,^4,5 it also comes with certain limitations. Wearable sensors have limited perspectives and information dimensions, which can result in partial information loss or difficulty capturing specific actions. Moreover, sensor data can be susceptible to noise and drift, leading to inaccurate action recognition outcomes. More importantly, acquiring wearable sensor data necessitates players to wear sensors actively, which can affect their comfort during sports. The use of video-based badminton action data can universally address these issues.

Another key challenge is the absence of a suitable network model for autonomously identifying and quantitatively assessing badminton actions. Badminton actions come in various forms, encompassing different strokes, footwork, and postures. Developing a matching deep learning model capable of encompassing and accurately recognizing all these actions is complex.

To address above challenges, our work is dedicated to exploring and developing novel deep learning models and methods for recognizing and evaluating badminton actions. In this work, we first set up a video collection environment to collect standard badminton action videos of players, and standardized action cutting and labeling are performed on the collected video data to build a preliminary dataset of standard badminton action videos. Next, we design a deep network model based on Convolutional Neural Networks (CNN) for human keypoints detection of players and design a human pose tracking network based on the Spatial Temporal Graph Convolutional Network (ST-GCN)⁶ to match the motion trajectory of a single individual for better recognition of specific badminton actions. In order to further reduce background interference caused by inaccurate action recognition, we use the SlowFast⁷ network structure, which analyzes the static information in the video using a slow high-resolution CNN channel (Slow pathway) and analyzes the dynamic information in the video using a fast low-resolution CNN channel (Fast pathway). Finally, in the quality assessment stage of badminton actions, we design a Siamese network with the SlowFast network as the backbone network to compare the input actions with those of professional badminton players, and thus obtain reasonable scores. The actions of professional players have been validated through extensive training and competition, demonstrating a high level of technical proficiency and stability, which provides a reliable benchmark for our assessment. The deep learning framework we have developed not only helps coaches to assess badminton actions of players, but also allows players to make pose corrections according to the changes of keypoints.

Our main contributions are summarized as follows:

We preliminarily construct a video dataset of standard badminton actions and use it to train and validate our badminton action recognition and quality assessment network.

We have constructed an end-to-end quality assessment network framework of badminton actions based on deep learning. Specifically, we design a human pose estimation and tracking network to detect the keypoints of individual players in the video dataset, and use the analysis of keypoint differences to guide players to make correct badminton actions.

We use the SlowFast network framework for badminton action recognition and propose a Siamese network with it as the backbone network. We use this Siamese network to compare the differences between actions of players and standard actions, thereby achieving automated quality assessment of badminton actions.

Experimental results show that our work is reasonable and can be applied effectively in deep learning-based badminton action recognition and quality assessment.

2. Related works

2.1. Sports dataset construction

Currently, the relevant datasets for sports scenes are not yet complete. In order to use deep learning networks for the recognition, detection, and other related tasks of specific sports actions, it is necessary to construct specific sports datasets further. Parmar et al.⁸ created a multitask dataset including seven tasks like diving and skiing. They used the C3D network as the backbone network to predict scores for the seven sports. Shao et al.⁹ proposed a large-scale, high-quality, hierarchical annotated dataset for fine-grained gymnastic actions named FineGym. They also analyzed existing action recognition methods from multiple levels and perspectives. Xu et al.¹⁰ constructed a fine-grained competitive sports video dataset, FineDiving, and designed a more reliable and transparent scoring method. FineDiving focuses on various diving events. The dataset constructed by Ban et al.,¹¹ BadmintonDB, includes multi-stroke rallies, stroke actions, and outcome annotations from nine real badminton matches between top players, which can be used for predicting professional players’ performance. However, since it does not contain action samples from non-professional players, BadmintonDB has certain limitations in terms of action recognition and quality assessment. The dataset proposed by Li et al.¹² focuses on badminton videos, providing a diverse, action-centric data. In our work, we focus on badminton action recognition and quality assessment and construct a video dataset of standard badminton actions.

2.2 Human pose estimation and tracking

According to the different processes of image processing, human pose estimation is generally divided into top-down methods^13–17 and bottom-up methods.^18–22 Sun et al.¹³ proposed the HRNet, a high-resolution network that achieved a reliable high-resolution representation through parallel multi-resolution subnets and repeated multi-scale fusion. Jingdong Digital Technology team¹⁵ designed a Siamese Graph Convolution Network for human pose tracking, which effectively captured the similarity of human poses based on skeletal representation. Snower et al.¹⁶ designed a simple and efficient tracking network, KeyTrack, based on Transformer,²³ which used only the detected keypoint information in the tracking phase. Raaj et al.²¹ proposed a spatiotemporal affinity field (STAF) structure that can encode across video sequences, as well as a novel cross-limb temporal topological graph.

2.3 Action recognition and quality assessment

Action recognition and quality assessment are essential and challenging tasks in computer vision. Deep learning-based approaches have achieved state-of-the-art results in Action Recognition^24–31 and Quality Assessment.^32–34 Pallabi Ghosh et al.²⁴ proposed a spatial-temporal graph convolutional network (GCN) that learns to capture the long-term dependencies between body parts for action recognition. Reference²⁶ proposed PoseConv3D, a novel skeleton-based action recognition method. PoseConv3D is more effective in learning spatiotemporal features, more robust to pose estimation noise, and demonstrates better generalization in cross-dataset settings. Rahmad et al.²⁸ developed an automatic badminton action recognition model based on a deep learning pretrained AlexNet convolutional neural network (CNN) for feature extraction, with features classified using an SVM. However, this work only supports the recognition of two action classes. Gao et al.³² modeled the asymmetric interactions among agents for action quality assessment. They proposed an asymmetric interaction module (AIM) to explicitly model asymmetric interactions between intelligent agents within an action, where they grouped these agents into primary and secondary ones. Bai et al.³⁴ proposed a temporal parsing transformer to decompose the holistic feature into temporal part-level representations. They utilized a set of learnable queries to represent the atomic temporal patterns for a specific action, and adopted the state-of-the-art contrastive regression based on the part representations.

3. Our approach

Our research utilizes deep learning techniques in computer vision to analyze badminton technical actions. It involves constructing a standard technical action video dataset for badminton, designing a baseline for human pose estimation and tracking, action recognition and quality assessment. The dataset is constructed by collecting and annotating videos of technical standard actions. Subsequently, a series of deep neural networks are trained for human pose estimation, tracking, action recognition, and action quality assessment. Our work represents a significant advance in the automated recognition and quality assessment of badminton actions, demonstrating the potential of deep learning techniques in enhancing badminton education.

3.1 Badminton action video dataset construction

We construct a video capture environment to shoot videos of professional players and non-professional badminton enthusiasts. The advantages of our dataset over existing datasets are reflected in the multi-angle video capture, standardized action annotation, the inclusion of negative samples, and the classification of actions specific to badminton. We use a multi-angle method for professional players, capturing videos from the front, back, left, and right perspectives. These videos, along with those of famous badminton players from major international competitions, are used to construct the badminton technical standard action video dataset. Non-professional badminton enthusiasts use phones or cameras to capture videos with a single fixed viewpoint, which are only employed as negative samples in training the action scoring network model.

We adopt VoTT software³⁵ for both image-level and video-level annotation in the manual data labeling process. For image-level annotation, the position of each professional player in the multi-angle videos is labeled using VoTT software, along with ground truth labels for the coordinates of each keypoint for pose estimation. The image-level annotation adheres to the same standards as the COCO dataset.³⁶ Video-level annotation involves labeling successive frames in multi-angle videos of professional players using VoTT software. This process enables the representation of comprehensive technical action annotation. The video-level annotation categorizes actions into six categories²⁷ : serve a ball, hit a high ball, smash a ball, lob a ball, forehand pick a ball, and backhand pick a ball. Table 1 and Table 2 present the specifications of the video-level standard dataset. We employ a stratified sampling method to ensure that the training and validation sets contain an equal distribution of stroke categories. Specifically, we randomly select samples within each stroke category and divided them according to an 80:20 ratio, which prevents certain categories from being overly scarce in either the training or validation set.

Table 1.
Statistics on the number of stroke categories in the badminton video dataset.

Action category Serve a ball Hit a high ball Smash a ball Lob a ball Forehand pick a ball Backhand pick a ball Total

Number 41 55 51 65 71 74 357

Action category	Serve a ball	Hit a high ball	Smash a ball	Lob a ball	Forehand pick a ball	Backhand pick a ball	Total
Number	41	55	51	65	71	74	357

Table 2.

Badminton video dataset video clip statistics.

Train dataset	Valid dataset	Maximum duration	Minimum duration	FPS	Capacity
284	73	4 s	2 s	60	4 G

3.2 Human pose estimation network based on deep learning

Based on the human skeleton keypoint sequence, we perform pose estimation on labeled data captured from professional remote movements in front, back, left, and right views. The human skeleton keypoint sequence adopts keypoints consistent with the COCO dataset, with the skeleton diagram of these keypoints presented in Figure 1.

Figure 1.

Keypoints of COCO dataset.

We adopt a top-down structure for the deep learning-based human keypoint pose estimation network utilized in this study.³⁷ In this paper, we introduce the use of the YOLO-v3 network³⁸ for human detection, which provides the human bounding box required for pose estimation. The image information within the bounding box is then upsampled or downsampled using functions from the OpenCV library to unify all images into human bounding box images of the same resolution size. This unified human bounding box is used as the input for the pose estimation framework.

To implement human pose estimation, we utilize the HRNet.¹⁸ First, the pre-trained model obtained from the COCO dataset is employed to train the badminton technical action video dataset constructed in this study, obtaining the keypoint pose of badminton players. This approach helps prevent overfitting because the pre-trained model can learn more generalized features, improving its performance on smaller datasets. The fully convolutional layer outputs 17 keypoint heatmaps in the final output layer. Nevertheless, due to the limitation of HRNet in producing a heatmap resolution that is merely 1/4 of the original picture resolution, it becomes necessary to upsample the heatmap in order to acquire a heatmap with a higher resolution. The coordinate information of the keypoint is then obtained by taking the local maximum value of the obtained high-resolution heatmap. Figure 2 illustrates the structure of the HRNet, where the feature map is formed by stacking multiple frames of input images and is upsampled and downsampled through vanilla convolution operations to obtain the final output feature map.

Figure 2.

The structure of high-resolution network.

3.3 Human pose tracking network based on deep learning

Considering spatial and pose-related dependencies, we propose a human pose-tracking network based on the Spatial-Temporal Graph Convolutional Network (ST-GCN).⁶ The structure of the ST-GCN is shown in Figure 3.

Figure 3.

Spatial-temporal graph convolutional network.

The proposed spatial-temporal convolutional network architecture is built upon the skeletal graph, where each node represents a human keypoint. As illustrated in Figure 3, the input skeletal feature data is first batch-normalized, with the input data being the keypoint coordinate vector on each graph node. The feature data is then sequentially fed into nine ST-GCN blocks, each comprising an attention module (ATT), a graph convolutional module (GCN), and a temporal convolutional module. Finally, a pooling layer is utilized to extract high-level features of 256 dimensions, followed by a fully connected layer for classification and label output. In the nine layers of ST-GCN blocks with identical structures, the outputs of the first three, middle three, and last three layers have 64, 128, and 256 channels, respectively. Additionally, after each ST-GCN block, a dropout operation with a random probability of 0.5 is employed to prevent overfitting.³⁹

A matching model is utilized for person re-identification following a spatial-temporal graph convolutional network. The matching model extracts features related to human pose, giving the results greater interpretability. Furthermore, the strong relationship between bounding boxes directly constrains them. Improved tracking performance is achieved by using the keypoints of the person to obtain the region of interest (ROI) while ensuring the distinction between candidate regions and employing pose features for skeleton-based pose matching.

3.4 Badminton action recognition network based on deep learning

This work collects video scenes with two distinct parts: static regions with little or slow changes and dynamic regions with continuous changes. Despite locating specific individuals using tracking models, irrelevant background regions still exist. To address this issue, the SlowFast⁷ network is adopted for badminton action recognition.

The SlowFast network utilizes a slow high-resolution convolutional neural network channel (Slow pathway) to analyze the static content of the video while simultaneously utilizing a fast low-resolution convolutional neural network channel (Fast pathway) to analyze the dynamic content. The network structure is depicted in Figure 4.

Figure 4.

The structure of slowFast network.

Both the Slow pathway and Fast pathway use the 3D ResNet⁴⁰ model to extract features from the input continuous video frames after subsampling. The Slow pathway uses a larger temporal stride (the number of frames skipped per second), usually set to τ=16, which means that every 16 frames from the original continuous video frames are sampled and fed into the model.⁷ The Fast pathway uses a very small temporal stride τ=16/α, where α is set to 8, meaning that every two frames from the original video frames are sampled.⁷ To remain lightweight, the Fast pathway uses a significantly smaller convolutional width (number of filters) than the Slow pathway, usually set to β = 1/8 of the convolutional width in the Slow pathway.⁷ The reason for using a smaller convolutional width is that the computation required by the Fast pathway is much less than that of the Slow pathway. As shown in Figure 4, where T represents the number of frames and C represents the feature dimension, the Fast pathway has eight times as many frames as the Slow pathway, but the feature dimension is reduced by 1/8. It can be inferred that the Fast pathway learns more temporal information from video frames, while the Slow pathway mainly learns spatial information.

3.5 Badminton action quality assessment network

For the badminton action quality assessment network, we regard professional players as positive samples, with their scores set as the maximum value achievable. Data from non-professional players is regarded as negative samples. For each action category, both positive and negative samples are fed into a Siamese network, whose backbone is the aforementioned trained badminton action recognition network. Through training with a contrastive loss function, the feature representations of negative samples closer to the standard form of professional players are pulled towards higher scores. In contrast, samples further from the standard are pulled towards lower scores. The scoring is based on the feature distance between the sample and the positive sample, with the mapping or scaling of this distance ensuring that scores fall within a reasonable range. Hence, a new sample of badminton action can be scored by comparing it with the standard form of professional players. Figure 5 shows the architecture of the action quality assessment network.

Figure 5.

The structure of badminton action quality assessment network.

4. Experiments

4.1 Evaluation metrics

Experiments utilize Mean Average Precision (mAP)¹³ to measure the accuracy of multi-person pose estimation within frames, and multiple object tracking (MOT) metrics¹⁵ are employed to evaluate pose tracking for all human joints independently. The primary metrics used to evaluate the performance of the human pose tracking network include Multiple Object Tracking Accuracy (MOTA) and Multiple Object Tracking Precision (MOTP).

The calculation of mAP involves averaging all AP metrics. To calculate the AP, assuming there are M positive samples among N samples, M recall values can be obtained, representing the proportion of correct results judged by the model out of the total observations. For each recall value, the corresponding maximum precision $(r^{'} \geq r)$ is calculated. This represents the proportion of correctly predicted positive results among all model-predicted positive results. The final AP value is obtained by calculating the average of these M precision values. The AP value can also be computed as the area under the curve formed by recall values and precision values with the following formula:

\begin{aligned} AP = \int_{0}^{1} P (r) d r \end{aligned}

(1)

MOTA evaluates the tracking accuracy, which measures the ability to detect and maintain object trajectories. The calculation for MOTA is as follows:

\begin{aligned} MOTA = 1 - \frac{\sum_{t} (m_{t} + f p_{t} + m m e_{t})}{\sum_{t} g_{t}} \end{aligned}

(2)

In (2), $m_{t}$ represents the number of missed targets in detection, $f p_{t}$ represents the number of false positive targets in detection, $m m e_{t}$ represents the number of times that incorrect matches occur during tracking, and $g_{t}$ represents the total number of ground-truth samples. MOTP evaluates the tracking accuracy and primarily reflects the performance of the detector. The calculation for MOTP is as follows:

\begin{aligned} MOTP = \frac{\sum_{i, t} d_{t}^{i}}{\sum_{t} c_{t}} \end{aligned}

(3)

In (3), $d_{t}^{i}$ denotes the distance between the predicted location and the actual location, while $c_{t}$ represents the number of successful matches between the predicted location and the ground-truth location.

4.2 Quantitative experiments

Our approach employs the badminton technique standard action video dataset to train the HRNet using a pre-trained model on the COCO dataset. The estimation accuracy of different human keypoints is shown in Table 3. The mean average precision of pose estimation is as high as 83.2%, indicating that our model can accurately estimate pose keypoints of badminton players in video datasets. As a result, our approach can achieve a superior human pose tracking effect. This capability is beneficial for subsequent keypoint-based pose tracking.

Table 3.
Human pose estimation results for badminton motion.

Methods Head(%) Shou(%) Elb(%) Wri(%) Hip(%) Knee(%) Ankl(%) mAP(%)

Ning et al.¹⁵ 72.4 84.1 83.1 79.4 81.3 81.2 75.3 79.5

Xiao et al¹⁷ 75.2 87.5 87.5 82.4 83.4 83.3 78.4 82.5

Raaj et al²¹ 68.4 80.5 79.6 74.9 76.8 76.1 71.6 75.4

Ours 75.7 89.2 88.1 83.0 85.0 84.9 79.9 83.2

Methods	Head(%)	Shou(%)	Elb(%)	Wri(%)	Hip(%)	Knee(%)	Ankl(%)	mAP(%)
Ning et al.¹⁵	72.4	84.1	83.1	79.4	81.3	81.2	75.3	79.5
Xiao et al¹⁷	75.2	87.5	87.5	82.4	83.4	83.3	78.4	82.5
Raaj et al²¹	68.4	80.5	79.6	74.9	76.8	76.1	71.6	75.4
Ours	75.7	89.2	88.1	83.0	85.0	84.9	79.9	83.2

* The pose estimation accuracy of each part is listed in the table, and the final mean average accuracy is calculated.

According to the training results of our network on the dataset in Table 4, the MOTA of our model is reported to be 81.4%, with a MOTP of up to 90.7%. These indicate that our network can effectively and consistently track the position of each target during the hitting process of badminton players.

Table 4.

Human pose tracking and human action recognition results for badminton motion.

Methods	MOTA(%)	MOTP(%)	TOP-1(%)	TOP-3(%)
PoseC3D²⁶	79.4	89.2	81.8	94.5
Timesformer³⁰	72.0	88.2	73.5	94.2
Swim³¹	80.6	89.8	82.3	95.8
Ours	81.4	90.7	83.1	96.9

* Top-1 accuracy is the largest one in the last probability vector of the predicted label, which is taken as the predicted result. If the classification with the largest probability is correct, the prediction is correct, otherwise the prediction is wrong. Top-3 accuracy is taking the Top three probabilities with the largest final probability vector among the predicted labels. If the three probabilities contain the correct classification, the prediction is correct, otherwise it is wrong.

In the action recognition task, we focus on detecting and tracking the badminton strokes of players. Through parameter settings and training with appropriate data, our model is designed to detect and track the actions of the target-hitting player, achieving our expected effect. Building on this foundation, our work aims to recognize the strokes of the detected target-hitting player. We utilize both positive and negative samples to accomplish this aim. When assessing the quality of positive sample actions, we set the actions of professional athletes as the full score standard, and other samples are scored based on their similarity to professional actions. We input these samples into the Siamese network for badminton action recongnition and quality assessment.

Table 4 shows the results for action recognition. The Top-1 accuracy is calculated at 83.08%, indicating that our model correctly detects the label with the highest probability of 83.08%. Similarly, the measurement of Top-3 accuracy yields a value of 96.89%, denoting that the top three recognition probabilities, which encompass the true label, exhibit an accuracy rate of 96.89%. These results demonstrate the effectiveness of our approach, which achieves high recognition accuracy for badminton actions of players.

Figure 6 illustrates the confusion matrix for the badminton action recognition model. Each cell displays the model's recognition performance for different action categories in decimal format. The values on the diagonal represent the model's correct recognition rates for each category, with the ‘Smash a ball’ category achieving the highest recognition rate of 85.0%. In comparison, the ‘Hit a high ball’ category has the lowest recognition rate at 80.0%. As shown in Figure 6, the model performs well across most action categories, but there is notable confusion in the ‘Hit a high ball’ category. This information provides valuable insights for further model improvement and data processing optimization.

Figure 6.

Confusion matrix of badminton action recognition.

4.3 Qualitative experiments

In this work, we assess the quality of badminton actions in the test videos using a standardized dataset. Results of our quantitative experiment are presented in Figure 7–13, where each human target is identified and evaluated based on the consistency of their badminton movements. The higher the standard and consistency of the movements, the higher the final quality score. However, when we applied our action quality assessment method to professional players in actual competition (Figure 12, left), we observed a lower quality score of only 78.74%. This indicates that strokes can vary to different degrees in real-life competitions, unlike the standardized dataset used in our study. Our dataset is based on the most standardized and optimal physical condition of players, and our research has implications for both badminton training and quality analysis of competition actions.

Figure 7.

Positive sample serve and score.

Figure 8.

Positive sample smash and score.

Figure 9.

Positive sample lob and score.

Figure 10.

Positive sample forehand pick and score.

Figure 11.

Positive sample backhand pick and score.

Figure 12.

Negative sample smash and score.

Figure 13.

Negative sample lob and score.

5. Conclusions

This work constructs a preliminary dataset of standard action technique videos for badminton players and develops models for player pose estimation and tracking. The badminton action of the target players is detected, recognized, and evaluated based on its quality. The trained model structure using the constructed dataset exhibits excellent generalization performance and accomplishes the expected results with high precision and accuracy. Consequently, a quantitative and qualitative intelligent actions assessment system for badminton players is established, which positively impacts research on badminton action, teaching quality assessment and providing guidance to train and enhance competitive skills for badminton players.

Footnotes

Acknowledgements

This work was supported in part by the Sports Technology Project of Shanghai Administration of Sports under Grant 21C001.

ORCID iD

Yichen Feng

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

. Application of artificial intelligence in basketball sport. Journal of Education, Health and Sport (2021); 11: 54–67.

Ravi

Wong

, et al. Deep learning for human activity recognition: a resource efficient implementation on low-power devices. In: 2016 IEEE 13th international conference on wearable and implantable body sensor networks (BSN). Piscataway, NJ: IEEE, 2016, pp. 71–76.

Reyaz

Ahamad

Khan

, et al. Machine learning in sports talent identification: a systematic review. In: 2022 2nd international conference on emerging frontiers in electrical and electronic technologies (ICEFEET). Piscataway, NJ: IEEE, 2022, pp. 1–6.

Wang

Fang

, et al. Automatic badminton action recognition using cnn with adaptive feature extraction on sensor data. In: Intelligent computing theories and application: 15th international conference, ICIC, Nanchang, China, August 3–6. Berlin, Germany: Springer, 2019, pp. 131–143.

Qin

Wang

Guo

, et al. Optimizing badminton action recognition with deep learning and sensor fusion: a study of sensor numbers and combinations. In: 2023 IEEE 13th international conference on CYBER technology in automation, control, and intelligent systems (CYBER). Piscataway, NJ: IEEE, 2023, pp. 273–278.

Yan

Xiong

Lin

. Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Proceedings of the AAAI conference on artificial intelligence. Palo Alto, CA: AAAI Press, 2018, pp. 1-9.

Feichtenhofer

Fan

Malik

, et al. Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF international conference on computer vision. Palo Alto, CA: IEEE, Piscataway, 2019, pp. 6202–6211.

Parmar

Morris

. Action quality assessment across multiple actions. In: 2019 IEEE winter conference on applications of computer vision (WACV). Piscataway, NJ: IEEE, pp. 1468–1476.

Shao

Zhao

Dai

, et al. Finegym: a hierarchical video dataset for fine-grained action understanding. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Piscataway, NJ: IEEE, 2020, pp. 2616–2625.

10.

Rao

, et al. Finediving: a fine-grained dataset for procedure-aware action quality assessment. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Piscataway, NJ: IEEE, 2022, pp. 2949–2958.

11.

Ban

K-W

See

Abdullah

, et al. Badmintondb: a badminton dataset for player-specific match analysis and prediction. In: Proceedings of the 5th international ACM workshop on multimedia content analysis in sports. New York, NY: ACM, 2022, pp. 47–54.

12.

Chiu

T-C

Huang

H-W

, et al. Videobadminton: a video dataset for badminton action recognition, in: 2024 IEEE International Conference on Big Data (BigData). Piscataway, NJ: IEEE, 2024, pp. 1387–1392.

13.

Sun

Xiao

Liu

, et al. Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Piscataway, NJ: IEEE, 2019, pp. 5693–5703.

14.

Wang

Long

Gao

, et al. Graph-pcnn: two stage human pose estimation with graph pose refinement. In: Computer vision–ECCV 2020: 16th European conference, proceedings, part XI 16, August 23–28, Glasgow, UK. Berlin, Germany: Springer, 2020, pp. 492–508.

15.

Ning

Pei

Huang

. Lighttrack: a generic framework for online top-down human pose tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops. Piscataway, NJ: IEEE, 2020, pp. 1034–1035.

16.

Snower

Kadav

Lai

, et al. 15 Keypoints is all you need. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Piscataway, NJ: IEEE, 2020, pp. 6738–6748.

17.

Xiao

Wei

. Simple baselines for human pose estimation and tracking. In: Proceedings of the European conference on computer vision (ECCV). Berlin, Germany: Springer, 2018, pp. 466–481.

18.

Kreiss

Bertoni

Alahi

. Pifpaf: composite fields for human pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Piscataway, NJ: IEEE, 2019, pp.11977–11986.

19.

Newell

Huang

Deng

. Associative embedding: End-to-end learning for joint detection and grouping. Adv Neural Inf Process Syst 2017, 1-1130.

20.

Cheng

Xiao

Wang

, et al. Higherhrnet: scale-aware representation learning for bottom-up human pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Piscataway, NJ: IEEE, IEEE2020, pp. 5386–5395.

21.

Raaj

Idrees

Hidalgo

, et al. Efficient online multi-person 2d pose tracking with recurrent spatio-temporal affinity fields. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Piscataway, NJ: IEEE, 2019, pp. 4620–4628.

22.

Jin

Liu

Ouyang

, et al. Multi-person articulated tracking with spatial and temporal embeddings. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Piscataway, NJ: IEEE, 2019, pp. 5664–5673.

23.

Vaswani

Shazeer

Parmar

, et al. Attention is all you need. Adv Neural Inf Process Syst 201730, 1-11.

24.

Ghosh

Yao

Davis

, et al. Stacked spatio-temporal graph convolutional networks for action segmentation. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. Piscataway, NJ: IEEE, 2020, pp. 576–585.

25.

Shi

Zhang

Cheng

, et al. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Piscataway, NJ: IEEE, 2019, pp. 12026–12035.

26.

Duan

Zhao

Chen

, et al. Revisiting skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Piscataway, NJ: IEEE, 2022, pp. 2969–2978.

27.

Raj

Consul

Pal

. Fast neural accumulator (NAC) based badminton video action classification. In: Intelligent systems and applications: proceedings of the 2020 intelligent systems conference (IntelliSys) volume 1. Berlin, Germany: Springer, 2021, pp. 452–467.

28.

Rahmad

As’ ari

Ibrahim

, et al. Vision based automated badminton action recognition using the new local convolutional neural network extractor. In: Enhancing health and sports performance by design: proceedings of the 2019 movement, health & exercise (MoHE) and international sports science conference (ISSC). Berlin, Germany: Springer, 2020, pp. 290–298.

29.

Isa

WHM

Abdullah

Razman

MAM

, et al. Deep learning algorithms for recognition of badminton strokes: a study using SDNN, RNN, and RNN-GRU models with off-court video capture. In: International conference on mechatronics and intelligent robotics. Berlin, Germany: Springer, 2023, pp. 53–60.

30.

Bertasius

Wang

Torresani

. Is space-time attention all you need for video understanding? In: ICML. ACM, New York, NY, 2021, p. 4.

31.

Liu

Ning

Cao

, et al. Video swin transformer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Piscataway, NJ: IEEE, 2022, pp. 3202–3211.

32.

Gao

Zheng

W-S

Pan

J-H

, et al. An asymmetric modeling for action assessment. In: Computer vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, proceedings, part XXX 16. Berlin, Germany: Springer, 2020, pp. 222–238.

33.

Pan

J-H

Gao

Zheng

W-S

. Action assessment by joint relation graphs. In: Proceedings of the IEEE/CVF international conference on computer vision. Piscataway, NJ: IEEE, 2019, pp. 6331–6340.

34.

Bai

Zhou

Zhang

, et al. Action quality assessment with temporal parsing transformer. In: European conference on computer vision. Berlin, Germany: Springer, 2022, pp. 422–438.

35.

VoTT: Vott (visual object tagging tool). https://github.com/microsoft/VoTT/blob/master/README.md . 2019.

36.

Lin

T-Y

Maire

Belongie

, et al. Microsoft coco: common objects in context. In: Computer vision–ECCV 2014: 13th European conference, Zurich, Switzerland, September 6–12, 2014, proceedings, part V 13. Berlin, Germany: Springer, 2014, pp.740–755.

37.

Gamra

Akhloufi

. A review of deep learning techniques for 2D and 3D human pose estimation. Image Vis Comput 2021114,104282.

38.

Redmon

Farhadi

. Yolov3: An incremental improvement, arXiv preprint arXiv:1804.02767. 2018.

39.

Song

Y-F

Zhang

and Shan

, et al. Richly activated graph convolutional network for robust skeleton-based action recognition. IEEE Trans Circuits Syst Video Technol 2020; 31: 1915–1925.

40.

Zhang

Ren

, et al. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. Piscataway, NJ: IEEE, 2016, pp. 770–778.

Deep learning-based badminton action recognition and quality assessment

Abstract

Keywords

1. Introduction

2. Related works

2.1. Sports dataset construction

2.2 Human pose estimation and tracking

2.3 Action recognition and quality assessment

3. Our approach

3.1 Badminton action video dataset construction

Table 1. Statistics on the number of stroke categories in the badminton video dataset. Action category Serve a ball Hit a high ball Smash a ball Lob a ball Forehand pick a ball Backhand pick a ball Total Number 41 55 51 65 71 74 357

4.1 Evaluation metrics

Footnotes

Acknowledgements

ORCID iD

Funding

Declaration of conflicting interests

References

Table 1.
Statistics on the number of stroke categories in the badminton video dataset.

Action category Serve a ball Hit a high ball Smash a ball Lob a ball Forehand pick a ball Backhand pick a ball Total

Number 41 55 51 65 71 74 357