Badminton video action recognition based on time network

Abstract

With the continuous development of artificial intelligence research, computer vision research has shifted from traditional “feature engineering”-based methods to deep learning-based “network engineering” methods, which automatically extracts and classifies features by using deep neural networks. Traditional methods based on artificial design features are computationally expensive and are usually used to solve simple research problems, which is not conducive for large-scale data feature extraction. Deep learning-based methods greatly reduce the difficulty of artificial features by learning features from large-scale data and are successfully applied in many visual recognition tasks. Video action recognition methods also shift from traditional methods based on artificial design features to deep learning-based methods, which is oriented to building more effective deep neural network models. Through collecting and sorting related research results found that academic for timing segment network of football and basketball video action research is relatively rich, but lack of badminton research given the above research results, this study based on timing segment network of badminton video action identification can enrich the research results, provide reference for follow-up research. This paper introduces the lightweight attention mechanism into the temporal segmentation network, forming the attention mechanism-timing segmentation network, and trains the neural network to get the classifier of badminton stroke action, which can be predicted as four common types: forehand stroke, backhand stroke, overhead stroke and pick ball. The experimental results show that the recognition recall and accuracy of various stroke movements reach more than 86%, and the average size of recall and accuracy is 91.2% and 91.6% respectively, indicating that the method based on timing segmentation network can be close to the human judgment level and can effectively conduct the identification task of badminton video strokes.

Keywords

Time segmentation network video action recognition action recognition

1. Introduction

With the development of video motion recognition technology in the field of computer vision, the research of sports motion recognition is more and more widely used in statistical motion characteristics, kinematics research, and PE teaching display. For various ball games, their structural characteristics can be divided into two types: time and score, according to the type of game [1]. Time-type sports such as basketball, football and rugby have no special area for a certain player during the game [2]. Both players are in a mixed position and win the game through teamwork within a certain time interval. The score type includes tennis, badminton, table tennis, etc. [3]. Both players always play in their own areas and confront their opponents in position. This type is usually played at their own level [4]. When watching such games, spectators tend to pay attention to the movement characteristics of the players.

In badminton competition, the movement and posture information of athletes can provide important clues for understanding the game process and discovering the movement characteristics of players [5]. Badminton has similar characteristics with volleyball, tennis and table tennis, and they all meet the Markov process conditions. Each stroke of the player in the competition is completed in an instant [6]. In order to better assist the coach or the audience to understand and grasp the key information such as the players’ movements in the badminton videos, it is meaningful to realize the intelligent recognition of the movements of the badminton players [7].

Computer vision technology has made major breakthroughs in the direction of video action recognition, but most of them are generalized action recognition for different daily movements, and lack of relevant research on badminton video action recognition. If the hitting movements in the badminton video can be regularly positioned and the types of strokes in the badminton video can be more accurately judged, the video highlights of various types of hitting movements can be provided for the audience [8]. In addition, in the field of sports video analysis, the action classification of badminton can also be transferred to tennis and other events [9].

The type of badminton stroke action can be distinguished by the flight track and landing point of the ball, and the movement track and landing point of the ball are closely related to the posture of the player when hitting the ball [10]. This research is divided into forehand hitting, backhand hitting, high, kill and pick five common categories, and on the basis of the implementation of the badminton game video action positioning and intelligent segmentation of the video, through deep learning technology to learn the characteristics of the badminton game video action, realize the identification of the video can assist the coach analysis video movements, and make users appreciate the diversity of each shot video highlights.

2. Current status of video action recognition at home and abroad

2.1 Current status of video action

At present, the video action recognition field is mainly divided into cutting video action recognition and long-time video action recognition. The cropping video includes only one complete action, while the long-time video includes multiple continuous actions. In the recognition of long-term video actions, action time domain segmentation is a very important step, and there are clear action switching boundaries between adjacent actions in these videos [11]. However, there is no obvious boundary information in the adjacent hitting action, so the method based on long-term video action recognition is not suitable for the positioning of badminton strokes [12]. Badminton action metavideo belongs to cropped video. In order to realize the intelligent extraction of meta video, it is necessary to locate the single shot action in badminton video, in which the single shot action video segment constitutes metavideo. In the case of a short shot action sequence, the human motion segmentation algorithm TS-WMS cannot fully learn the shot action sequence, easy to appear redundant action segmentation points [13]. Ji et al. [14] proposed a way to capture the player’s stroke action according to the inflection point of the position and flight direction of the badminton, but this method is limited to the side view of the court when the player is located on the left and right sides of the video interface, which is not applicable for the broadcast perspective of most badminton game videos. For the action time domain positioning of non-trimmed videos, Li et al. [15] proposed a multi-scale sliding window integration method on the basis of the original dual-flow convolutional neural network, which can well locate the actions in UCF101 and HMDB51 data sets and reduce the influence of background noise. However, the positioning action instance only occupies a small part of the whole video, and the other parts belong to the segments of the interference behavior recognition model, which cannot locate multiple continuous actions in the video respectively [16]. In conclusion, the existing research methods for video action positioning have been satisfied with the video with clear action boundaries or long-term video, but it is difficult to realize the action positioning of badminton video.

2.2 Current status of badminton movement recognition

Zhang and Wu [17] used a gesture-based method to extract directional gradient histogram HOG from the player’s bounding box, and classify the strokes based on support vector SVM. Literature [18] based on HOG, but the training and test data used are a single image of the shot, with different hitting movements with very similar hitting gestures, such as killing and high, drawing and hanging. Li et al. [19] classified the shots of the compressed badminton videos and identifies the hitting action by detecting the movement trajectory of the badminton. Climent-Pérez and Florez-Revuelta [20] proposed a badminton video action recognition method based on the HOG characteristics of dense trajectory and trajectory alignment, which divides player strokes into forehand, backhand, kill and other types, but HOG itself does not have scale invariance, and HOG is quite sensitive to noise due to the nature of the gradient. Literature [21] in sports video often has poor pixel quality, not static video and image resolution low background, put forward a motion descriptor, and by detecting the key audio elements to capture the player swing image, finally use the support vector machine, the three kinds of athletes typical swing action-swing, left swing, right swing for classification. Zhao et al. [22] proposed a two-layer hidden Markov Model (HMM) classification algorithm based on the body sensor network to identify the badminton stroke type, but the stroke state data captured by the sensor is not suitable to effectively identify the badminton movements in the video. You et al. [23] compared the performance of four different deep convolutional pre-training models, AlexNet, GoogLeNet, VggNet-16 and VggNet-19, when classifying the badminton images to identify different movements of athletes, which finally showed that GoogLeNet has the highest classification accuracy, but it still targeted the static image of the badminton hitting moment, and failed to classify and identify the badminton action element video. As mentioned above, it is feasible to study the action identification of badminton element video based on GoogLeNet and combined with TSN network.

3. Badminton video action recognition based on time network

This chapter studies a method for identify badminton movements in metavideo. The meta-video dataset was first constructed, and the dataset was preprocessed. Based on the timing segmentation network, the identification of four action elements of forehand, backhand, overhead and picking is realized.

3.1 Meta-video dataset construction

The data sample input by the sequential segment network is a meta-video (a video clip including only one action). In order to obtain high-quality meta-video training set, maximize the shooting integrity of meta-video in the training set, and ensure the training data quality of the network, the manual establishment of meta-video data set is needed. Based on the highlights of badminton videos, this section uses the fast video merge segmentation tool, and finally divides 5,160 badminton yuan videos as a data set.

With the assistance of badminton school team members, the samples in the meta-video dataset are marked according to the five actions of forehand, backhand, high, kill and picking, and the frame sampling sequence of each action meta-video is shown in Fig. 1.

Figure 1.

Badminton action element video frame sequence sampling.

3.2 Met-video dual-stream extraction

TSN network (Temporal Segment Networks) inherits the dual-stream architecture of Two-Stream CNN, in which the input of Two-Stream CNN is two parts of frame and optical flow of metavideo. The structure is composed of spatial flow convolution and temporal flow convolution, which process frame image features and optical flow features respectively. Before training the TSN network, it is necessary to decompose the RGB frames of each metavideo in the training set, randomly sample the frames and then input spatial convolution, and transmit the light extracted from these frames and adjacent frames into the temporal convolution.

Badminton match video, like ordinary video, consists of two parts: spatial flow and time sequence flow. The spatial flow section refers to the surface information of each frame forming a video, including the scene, shape, size and spatial position of the object; the temporal flow refers to the optical flow information between frames and frames, reflecting the relationship between the moving object and the surrounding environment and the direction and speed of the object movement. The motion information between the frames is described by the optical flow, the pixel matching relationship between the current frame and the previous frame based on the pixel movement in the time domain and the correlation between the adjacent frames are obtained, and the optical flow information between the adjacent frames is calculated.

In mathematical form, the optical flow is a set of displacement vectors $d t$ between frame $T$ and frame $T+1$ . $dt({x,y})$ represents the displacement vector of frame $T$ at position $({x,y})$ , the coordinate corresponding from the pixel at coordinate $({x,y})$ to frame $T+1$ . The optical flow vectors of the horizontal and vertical parts are dtc and dty, respectively, corresponding to the pixel movement in the direction, which can be regarded as dual channels of the vector. In a video with frames, the optical flow field of a position $(x,y)$ in a frame may be represented by Eq. (1):

$\displaystyle I_{T}[{({x,y,2k-1}),({x,y,2k})}]=[{d_{T+k-1}^{X}({P_{k}}),d_{T+k% -1}^{y}({P_{k}})}]$ (1)

Where $x\in[{1,w}]$ , $y\in[{1,H}]$ , $k\in[{1,N}]$ , $P_{k}$ are the $k$ point on the optical flow track, let the initial motion point be the pixel in the coordinate $T$ in frame $({x,y})$ , and the vector coordinates can be defined by the recurrence relationship shown in the following Eq. (2):

$\displaystyle P_{k}=P_{k-1}+d_{T+k-2}({P_{k-1}})|{k>1,P_{1}=({x,y})}$ (2)

For the badminton game video, an example of the optical flow field extracted between the adjacent frames $f_{i-1}^{\textit{RGB}}$ and $f_{i}^{\textit{RGB}}$ is shown in Fig, where $f_{i}^{\textit{FLOW\_X}}$ and $f_{i}^{\textit{FLOW\_Y}}$ are the optical flow components in the horizontal and vertical directions, respectively.

Figure 2.

Schematic diagram of the optical flow extraction.

3.3 A temporal segmentation network introducing an attention mechanism

In order to better improve the model performance, introduced the attention mechanism SE (Squeeze-and-Excitation Networks) into the Inception module to achieve more ideal results [24]. SE makes neural networks focus on feature channels meaningful for the classification task and feature channels with little inhibition. However, it lacks the focus on the spatial location information. Drawing on this idea, this paper introduces the attention mechanism CBAM (Convolutional Block Attention Module) after the BN-Incepti on structure to improve the original TSN network to the CBAM-TSN network.

The CBAM structure includes two modules: channel attention mechanism CAM (Channel Attention Module) and spatial attention mechanism SAM (Spatial Attention Module). What features the CAM focuses on play a major role in the task.

The input feature F of size $H\times W\times C$ for spatial maximum pooling (Max Pooling) and average pooling (Avg Pooling), get two channel descriptors of size $1\times 1\times C$ , and then pass them into a shared network with two layers of convolution, where the activation function is ReLU, and then through Sigmoid activation of CAM weight coefficient $M_{C}$ , $M_{C}$ and F to multiply the new feature F’ of incoming SAM.

SAM focuses on which regional features are more effective for the task, similar to CAM, the average and maximum pooling of F’ on channels to obtain two channel descriptors of size $H\times W\times 1$ , and stitched together by channel dimensions, and then 7 $\times$ 7 SAM weight coefficient B by 7 $\times$ 7 convolution and Sigmoid activation, multiplying B and F’ to get the final output feature of CBAM F”. CAM and SAM can be combined in serial or parallel manner. The study shows that combining the two in serial order and putting CAM ahead achieves better results, and here, by introducing CAM and SAM in serial order into TSN.

The CBAM-TSN network structure is shown in the following figure. In order to effectively express the information of the whole video, the input metavideo is first divided into $K$ segments of equal length (the value of $K$ in Figure is 3): $\{{S_{1},S_{2},\cdots S_{K}}\}$ , and then a segment $K$ is obtained by random sampling from each segment $S_{K}\{{k=1,2,\cdots,K}\}$ . For each segment $K$ , the static information of the frame in the segment is captured by spatial convolution, while the temporal information of the segment is captured by temporal convolution, generate a spatial convolutional prediction for that segment. The prediction of different segments on each category is output by the convolutional layer. where segment $K$ , the prediction vector $F^{k}$ for various action categories is shown in Eq. (3), where $S_{i,k}$ ( $i\in[{1,C}],k\in[{1,K}]$ , $C$ is the total number of hitting action categories) is the prediction result of segment $T_{k}$ on category $i$ .

$\displaystyle F^{k}=f({T_{k};W})=\left[{{\begin{array}[]{*{20}c}{S_{1,k}}\\ {S_{2,k}}\\ \vdots\\ {S_{c,k}}\\ \end{array}}}\right]$ (3)

The convolution predictions of each segment are then aggregated through the piecewise aggregation function, and the weighted output of the two convolutional aggregation results in space and time order is taken as the final prediction result. The whole network constructs the loss function according to the prediction results and the actual type, iteratively updates the network node parameters, and trains the network end-to-end.

The dimensions of the input features and output features at each level of the TSN network are shown in Table 1, where the principal structure is BN-Inception. BN-Inception added a normalization layer (BN) to the original Inception-vl module, which effectively prevents network performance instability due to excessive data, where the structure configuration of the initial module 0 is shown in Table 2.

Table 1

CNN channel dimensions of all levels

Network level		Enter the feature dimension	Output feature dimensions
BN-Inception	Sto	224 $\times$ 224 $\times$ 3	28 $\times$ 28 $\times$ 192
	inception_3a	28 $\times$ 28 $\times$ 192	28 $\times$ 28 $\times$ 256
	inception_3bi	28 $\times$ 28 $\times$ 256	28 $\times$ 28 $\times$ 320
	nception_3c	28 $\times$ 28 $\times$ 320	14 $\times$ 14 $\times$ 576
	inception_4a	14 $\times$ 14 $\times$ 576	14 $\times$ 14 $\times$ 576
	inception_4b	14 $\times$ 14 $\times$ 576	14 $\times$ 14 $\times$ 576
	inception_4c	14 $\times$ 14 $\times$ 576	14 $\times$ 14 $\times$ 608
	inception_4d	14 $\times$ 14 $\times$ 608	14 $\times$ 14 $\times$ 608
	inception_4e	14 $\times$ 14 $\times$ 608	7 $\times$ 7 $\times$ 1056
	inception_5a	7 $\times$ 7 $\times$ 1056	7 $\times$ 7 $\times$ 1024
	inception 5b	7 $\times$ 7 $\times$ 1024	7 $\times$ 7 $\times$ 1024
CBAM		7 $\times$ 7 $\times$ 1024	7 $\times$ 7 $\times$ 1024
Avg Pool		7 $\times$ 7 $\times$ 1024	1 $\times$ 1 $\times$ 1024
Dropout		1 $\times$ 1 $\times$ 1024	1 $\times$ 1 $\times$ 1024
Linear		1024	4

Figure 3.

The CBAM-TSN network structure.

As can be seen from Table 1 above, high range ball and kill ball are highly consistent in hitting posture, and both belong to the overhead hitting category. In order to make the characteristics of different categories of metavideos show different player attitude characteristics in the training set, the high range ball and kill ball are temporarily classified as “overhead shot” in the training set.

Table 2

Initial stage $St_{0}$ structure configuration

Network level	Convolutional dimensions	Padding parameter	Convolutional step length	Enter the channel dimension	Output channel dimensions
CNN	7 $\times$ 7	(3,3)	(2,2)	3	64
BN	–	–	–	64	64
ReLU	–	–	–	64	64
MaxPool	3 $\times$ 3	0	(2,2)	64	64
CNN	1 $\times$ 1	0	(1,1)	64	64
BN	–	–	–	64	64
ReLU	–	–	–	64	64
CNN	3 $\times$ 3	(1,1)	(1,1)	64	192
BN	–	–	–	192	192
ReLU	–	–	–	192	192
MaxPool	3 $\times$ 3	0	(2,2)	192	192

The introduction of CBAM attention mechanism after BN-Inception can continue the pre-trained model for transfer learning. The characteristics of CBAM output pass through the average pooling layer (Avg Pool), Dropout layer, and the full connection layer (Linear) successively, transforming the final output feature dimension to 4, corresponding to the prediction results of the four actions of forehand shot, backhand shot, overhead shot, and pick ball.

The average pooling layer can eliminate the redundant information, compress the features, and reduce the computation and memory consumption. The Dropout layer can weaken the joint adaptability among the network nodes and enhance the robustness of the network model. The mathematical principle of normalization layer (BN) is shown in Eq. (4): can avoid the denominator of zero, here the value 1e-5, conducive to the stability of output characteristics.

$E({X_{i}})$ and $\textit{Var}({X_{i}})$ indicate the expectation and variance of the eigenvectors, respectively, and ${\gamma}$ and ${\beta}$ indicate the weight parameters and offset to be learned, respectively. Activation through the ReLU function after the BN layer, such as Eq. (5), makes the output value of some neurons zero, enhancing the sparsity of the network and reducing the possibility of overfitting. The “ $\bigoplus$ ” in the graph indicates the resulting features of each propagation path in $St_{i}$ , and the aggregated feature dimension is the sum of the output features of each propagation path.

$\displaystyle BN_{\gamma,\beta}({X_{i}})=\frac{X_{i}-E({X_{i}})}{\sqrt{\textit% {Var}(X)}+\varepsilon}\cdot\gamma+\beta$ (4) $\displaystyle\textit{ReLU}(X)=\max({0,X})$ (5)

For an aggregation layer in a TSN, the choice of its aggregation function $g$ determines whether the loss function is differentiable. Then category scores of multiple segments are combined to obtain category prediction aggregation B of the entire meta-video, as Eq. (6).

$\displaystyle G=g({F({T_{1};W}),F({T_{2};W}),\cdots,F({T_{k};W})})$ (6)

Based on this aggregation, the prediction function outputs the probability of the entire meta-video corresponding to each action. TSN models the sampled video segments as Eq. (7):

$\displaystyle\textit{TSN}({T_{1},T_{2},\cdots,T_{k}})=H(g({F({T_{1};W}),F({T_{% 2};W}),\cdots,F({T_{k};W})})$ (7)

The TSN network training uses the back-propagation algorithm to combine multiple segment predictions to optimize the model parameters $W$ . Parameter $W$ for the gradient of loss function $L$ as Eq. (8):

$\displaystyle\frac{\partial L({y,G})}{\partial W}=\frac{\partial L}{\partial G% }\mathop{\sum}\limits_{k=1}^{K}\frac{\partial G}{\partial F({T_{k}})}\frac{% \partial F({T_{k}})}{\partial W}+\frac{\lambda}{n}W$ (8)

Where $K$ is the number of segments of the TSN. As can be seen from the above equation, the parameter update of the network model utilizes the piecewise aggregation function $G$ derived from the prediction results of all segments. In this sense, the TSN does not learn the model parameters from each video segment, but rather from the entire input video.

3.4 Aggregate function analysis

The choice of aggregation function $g$ determines whether the loss function of the TSN is differentiable, allowing the TSN can be trained by reverse broadcast and affecting the test performance of the TSN network. This section describes three commonly used differentiable aggregation functions: average pooling, maximum pooling, linear weighting, deduces their gradients about the output of video segmentation, and analyzes the characteristics and applications of these aggregation functions.

The average pooling function is that for each category, the average of all segmented predicted values is taken as the aggregation result of the entire input video, such as Eq. (9):

$\displaystyle{g}_{{i}({{i}\in\{{1,2,\cdots,{C}}\}})}=\frac{1}{{K}}\mathop{\sum% }\limits_{{k}=1}^{K}{f}_{i}^{k}$ (9)

Where $i$ represents the category index, $C$ is the total number of categories, and $f_{i}^{k}$ refers to the $i$ th element in $F^{k}$ . The gradient of $g_{i}$ to $f_{i}^{k}$ can be expressed as equation for (10):

$\displaystyle\frac{\partial{g}_{i}}{\partial{f}_{i}^{k}}=\frac{1}{{K}}$ (10)

The basic purpose of average pooling aggregation is to make full use of the information of all segments in the video, taking the average output of the predicted value of all video segments as the final prediction result. This aggregation function can model all video segments to capture information about the entire video. However, for slightly longer videos, some segment information is unrelated to the action to be identified, and evenly pooling the predicted values of these video segments can affect the accuracy of the final identification.

The maximum pooling function is that for each category, the maximum predicted value of all segments is taken as the aggregation result of the entire input video and the gradient of $g_{i}$ to $f_{i}^{k}$ can be expressed as Eqs (11) and (12):

$\displaystyle g_{i({i\in\{{1,2,\cdots,C}\}})}=\max_{k=1,2,\cdots,K}f_{i}^{k}$ (11) $\displaystyle\frac{\partial g_{i}}{\partial f_{i}^{k}}=\left\{{{\begin{array}[% ]{l}{1,\textit{ifk}=\textit{arg}\max_{i}f_{i}^{k}}\\ {0,\text{otherwise}}\\ \end{array}}}\right\}$ (12)

The basic purpose of maximum pooling aggregation is to select a most discriminative video segment for each action category to generate the largest activation response to a certain category. For a category, it only on the video segments most relevant to that category and ignores the role of other video segments. Thus max pooling prompts TSN to learn from the segments with the largest activation response on each category, but cannot take advantage of the role of all video segments.

Linear weighted aggregation introduces the model parameter $w$ , $w_{k}$ which is the weight of the aggregation function for the $k$ th segment. $g_{i}$ The gradient relative to $f_{i}^{k}$ and $w_{k}$ is:

$\displaystyle\frac{\partial{g}_{i}}{\partial{f}_{i}^{k}}={w}_{k}$ (13) $\displaystyle\frac{\partial{g}_{i}}{\partial{w}_{k}}={f}_{i}^{k}$ (14)

The aggregation can be interpreted as an action can be composed of multiple parts, each part of the action recognition reflects different roles, and ultimately can learn the importance of each video segment for action recognition.

4. Case application of badminton video action recognition based on time network

4.1 CBAM-TSN network training

The experimental hardware platform in this paper is Intel Xeon Platinum 8160Ts CPU, TITANRTX 2080 GPU 2, and memory 1T. CUDA10 and CUDNN7.6 are the GPU acceleration library, which is completed in the Ubuntu18.0 operating system using the deep learning framework PyTorch1.1. The CBAM-TSN is tested by the retention method (Hold-Out), and according to the stratified sampling method, that is, the overall data is divided into disjoint categories, a certain proportion of samples are randomly selected from all kinds of data, and the extracted samples are taken together as a sample collection. Here 10% of the samples were drawn from each category of meta-video datasets and assigned to the test set and the remaining samples to the training set.

The pre-training model adopts bninception_caffe6. During the training process, the hierarchical ten-fold cross validation is adopted. At the beginning of the training stage before each validation, each metadvideo data in the training set is divided into ten folds into the validation set, and the remaining 90 percent participate in the training process of the network.

The learning rate was reduced to 10% when training to 30Epoch and 40Epoch. An important parameter of the sparse sampling strategy followed by the CBAM-TSN network is the number of segments A. When $K$ is equal to 1, the TSN will degenerate into an ordinary two-stream model (Two-Stream), which improves the model performance to some extent. $K$ is assigned 1–5, $K$ takes the network loss trend on the validation set with different values, and the predicted loss of the validation set is output once every 20 training times (Iteration). It can be seen that the convergence trend value of training loss decreases significantly when $K$ rises from 1 to 2, the training loss curve is almost sticky when $K$ rises from 2 to 5, and the performance is not significantly improved. When the iteration reaches about 35 rounds, the overall change trend of the training loss no longer decreases, but tends to converge.

4.2 Experimental result

For the backbone convolutional structure (backbone) in the TSN network, this paper uses four neural network structures: AlexNet, VGG16, BN-Inception and ResNet50, and combines with the pre-training model suitable for each network.

The average recall, the average recall and the AUC of the above four network models are further compared, and the comparison results are shown in Table 3. Among them, the network model with the highest average recall on the test set is VGG16, and the network model with the highest AUC is ResNet50, while BN-Inception has the highest average recall and microaverage corresponding AUC. Through comprehensive comparison, BN-Inception is adopted as the backbone convolution structure of TSN in this paper. The test set was compared with the TSN, TSN that introduced attention mechanism SE and TSN introducing attention mechanism CBAM. The pairs of microaverage and macro-average results in the average recall, average recall, and AUC are shown in Table 3.

Table 3
Ablation experiment evaluation

Model	$R$ (%)	$P$ (%)	Micro-AUC	Macro-AUC
TSN	90.2	92.5	0.9879	0.9868
SE-TSN	92.3	93.7	0.9887	0.9894
CBAM-TSN	93.5	94.3	0.9908	0.9903

By comparing the test accuracy of the three aggregate functions: maximum pooling, average pooling and linear weighting introduced in the above content, the comparison results are shown in Table 4. On the test set, the average pooling achieves the best accuracy, so the average pooling is taken as the aggregation function of the network model.

Table 4

Effect of different aggregation functions on the recognition accuracy

Aggregate function	Precision (%)
Average-pooling	94.3
Max pooling	91.6
Linear weighting	92.4

The change of test performance with $K$ value is shown in the figure, where the classification accuracy (Accuracy) is positively correlated with $K$ value, but the classification rate (Speed, the number of meta-videos continuously identified per minute) is negatively correlated. As can be seen from the figure, the model recognition accuracy is significantly improved in $K$ when rising from 1 to 2, but not significantly from 2 to 5. Because meta-video is a cropped short video, when $K$ is larger, the smaller the segmentation length difference from different $K$ values, and the smaller the difference of the sampled segmentation information. When $K$ is 3, the recognition rate and accuracy reach relative equilibrium, so the value of parameter $K$ is default to 3.

Figure 4.

Test performance changes with the parameter k.

For the performance evaluation of hitting action classification, the ROC curves of the CBAM-TSN network model for four categories, forehand, backhand, overhead, and pick, are shown in Fig. 5 below.

The AUC metrics of the four action categories of forehand, backhand, pick and overhead are all above 0.98, with microaverage (Micro-average) and macro average (Macro-average) all about 0.99, indicating that CBAM is introduced into TSN network and trained through transfer learning has high performance.

Table 5

Badminton motor recognition of the confusion matrix

Observation category	Prediction category
	Orehead hit	Backhand	High clear	Smash	Lift
Orehead hit	119	1	1	0	9
Backhand	1	76	2	1	3
High clear	1	0	70	10	0
Smash	2	0	4	75	0
Lift	6	1	0	1	127

Table 6

Badminton movement identification and recall rate and accuracy rate statistics

Strike category	$R_{i}$ (%)	$P_{i}$ (%)
Orehead hit	91.5	92.2
Backhand	91.6	97.4
High clear	86.4	90.9
Smash	92.6	86.2
Lift	94.1	91.4
Average	91.2	91.6

Figure 5.

Badminton action recognition ROC-AUC metric.

To further evaluate the identification performance of the model for each category sample, the metadvideo samples of the overhead shot belonging to the test set continue to be separated into high range ball and ball kill categories. The confusion matrix of the predicted results in the test set, as shown in the table. And the diagonal number of all predicted correct metavideo samples for each category, totaling 467, accounting for 91.6%, achieved good identification accuracy. But forehand and pick confusion, high ball and kill the ball is relatively more, because sometimes forehand shot and pick in the video will present a certain attitude similarity, and when the end of the video target is too small or too fuzzy, the morphological method of this paper is easy to mistake the high ball to kill ball, when the end of the video background area appear multiple strong dynamic noise, will also lead to kill easy to be misjudged as high ball.

Based on the confusion matrix, combined with the performance metrics, the recall and accuracy of the model recognition for each action can be calculated, as shown in Table 6 below.

As can be seen from Table 6, the recognition recall and accuracy of various stroke movements reach more than 86%, and the average size of recall and accuracy are 91.2% and 91.6% respectively, indicating that the method based on timing segmentation network can be close to the human judgment level to a large extent and can effectively conduct the identification task of badminton video strokes.

5. Conclusion

This paper mainly through the combination of theoretical research and experimental analysis of the badminton video clip extraction, badminton movement domain positioning and the study, the main work and conclusion is as follows: based on the timing segment network of badminton video clips, for the key movement trajectory, through the comparison of its amplitude changes. Based on the temporal segmentation network, combined with the lightweight attention mechanism, the common hitting action is classified and identified, and a video differentiation method of high distance ball and ball killing element based on image morphology processing is proposed. Combined with the experimental results, we classify and identify common hitting movements in badminton game videos, and propose a differentiation method for high distance ball and ball killing video based on image morphology processing.The experimental results show that the IoU index of the time domain positioning of badminton video is 82.6%, the AUC index of each action category is higher than 0.98, and the average recall and average accuracy rate are 91.2% and 91.6% respectively, indicating that this method can effectively realize the identification of badminton video hitting action.

Footnotes

Acknowledgments

2022 Hebei Province School Physical Education, Health and Art National Defense Education Special Task Research Project “Study on the evaluation index system of middle school students’ health literacy based on Analytic Hierarchy Process”. 2021 Hebei Higher Education Teaching Reform Research Project, No.: 2021GJJG558.

References

Sahoo

Ari

. On an algorithm for human action recognition. Expert Syst Appl. 2018; 115: 524-534.

. Human action recognition algorithm based on adaptive initialization of deep learning model parameters and support vector machine. IEEE Access. 2018; 6: 59405-59421.

Hou

. Research on the recognition algorithm of basketball technical action based on BP neural system. Sci Programming-Neth. 2022; 2022: 7668425.

Liang

. Application of Convolution Neural Network (CNN) model combined with pyramid algorithm in aerobics action recognition. Comput Intel Neurosc. 2021; 28(9): 13.

Wang

. Human action recognition algorithm based on multi-feature map fusion. IEEE Access. 2020; 8: 150945-150954.

Wang

Guan

Cui

. Study on action recognition of drilling overflow detection based on deep learning algorithm. In: Huang C, Chan YW, Yen N, editors. Advances in Intelligent Systems and Computing. Singapore: Springer; 2020. pp. 1993-2002.

Song

Yuan

Liu

. Human pose estimation and its application to action recognition: A survey. J Vis Commun Image R. 2021; 76: 103055.

Zhou

. Human action recognition based on action relevance weighted encoding. Signal Process-Image. 2020; 80: 115640.

Wang

. Aerobics action recognition algorithm based on three-dimensional convolutional neural network and multilabel classification. Sci Programming-Neth. 2021; 46(55): 79-81.

10.

Liu

Yuen

. A boosted co-training algorithm for human action recognition. IEEE T Circ Syst Vid. 2011; 21(9): 1203-1213.

11.

. Research on human action feature detection and recognition algorithm based on deep learning. Mob Inf Syst. 2022; 2022: 4652946.

12.

Shou

. Sports competition assistant system based on fuzzy big data and health exercise recognition algorithm. Mob Inf Syst. 2021; 31(7): 8.

13.

Cheng

. Action recognition algorithm based on complexity measure and multi-scale motion coding. Opt Tech. 2018; 44(4): 427-434.

14.

Jin

Cao

Wang

. Two-person interaction action recognition based on multi-source information fusion algorithm. In: 2018 Eighth International Conference on Information Science and Technology (ICIST). 2018 June 30–July 06; Cordoba, Granada, and Seville, Spain. New York: IEEE; 2018.

15.

Kang

Jin

Hai

. A foreground-focused action recognition algorithm for intelligent unmanned system. In: 2020 3rd International Conference on Unmanned Systems (ICUS). 2020 November 27–28; Harbin: China. New York: IEEE; 2020.

16.

Elharrouss

Almaadeed

Al-Maadeed

Bouridane

Beghdadi

. A combined multiple action recognition and summarization for surveillance video sequences. Appl Intell. 2021; 51(2): 690-712.

17.

Zhang

. Gesture action recognition method based on ED-DTW algorithm. Command Inf Syst Technol. 2018; 4(75): 26-27.

18.

Khan

Alhaisoni

Tariq

Yong

Armghan

, et al. Human action recognition: A paradigm of best deep learning features selection and serial based extended fusion. Sensors. 2021; 21(23): 7941.

19.

Yuan

Liu

. Human skeleton-based action recognition algorithm based on spatiotemporal attention graph convolutional network model. J Comput Appl. 2021; 41(7): 1915-1921.

20.

Climent-Pérez

Florez-Revuelta

. Privacy-preserving human action recognition with a many-objective evolutionary algorithm. Sensors. 2022; 22(11): 652.

21.

Tsai

Huang

. Temporal-variation skeleton point correction algorithm for improved accuracy of human action recognition. Int J Pattern Recogn. 2022; 36(10): 64.

22.

Zhao

Dang

Wang

Gao

. Dense trajectory action recognition algorithm based on improved SURF. IOP Conf Ser: Earth Environ Sci. 2019; 252(64): 032179.

23.

You

Guo

Shan

Dai

. A novel trajectory-VLAD based action recognition algorithm for video analysis. In: 7th International Conference on Identification, Information and Knowledge in the Internet of Things (IIKI). 2018 October 19–21; Beijing, China. 2019. pp. 165-171.

24.

Yilmaz

Guzel

Bostanci

Askerzade

. A novel action recognition framework based on deep-learning and genetic algorithms. IEEE Access. 2020; 8: 100631-100644.

Badminton video action recognition based on time network

Abstract

Keywords

1. Introduction

2. Current status of video action recognition at home and abroad

2.1 Current status of video action

2.2 Current status of badminton movement recognition

3. Badminton video action recognition based on time network

3.1 Meta-video dataset construction

4.1 CBAM-TSN network training

4.2 Experimental result

Table 3 Ablation experiment evaluation

Footnotes

Acknowledgments

References

Table 3
Ablation experiment evaluation