Action recognition technology for professional basketball game training based on SSD and 3D convolution

Abstract

In professional basketball games, athlete action recognition is an important part of sports performance analysis and intelligent assisted training. The complex scene background, frequent target occlusion, and varied motion patterns pose challenges to traditional detection and recognition methods in terms of accuracy and real-time performance. In view of this, the study proposes an improved object detection model for single shot multi-box detectors by combining pyramid feature integration module and dual axis convolutional perception module. At the same time, an action recognition model based on an improved 3D convolutional network is designed through adaptive weight fusion and attention mechanism. In the testing of the object detection model, when the number of iterations reached 500, the average accuracy improved by 6.14% and the frame rate decreased by 8.56%. The missed detection rate under low light conditions was 7.3%, the false detection rate was 8.7%, and the detection time was 30.8 ms. The highest detection accuracy of the action recognition model in complex backgrounds was 89.3%, and the robustness score was 91.9. The results indicate that the proposed model can maintain high accuracy and efficiency in complex backgrounds and fast movements in professional basketball game scenarios. The research model significantly improves the performance and robustness of action recognition, which can provide certain technical support for intelligent sports training systems.

Keywords

SSD 3D convolution basketball match action recognition single shot multi-box detector

Introduction

The rapid development of the global intelligent sports industry, accompanied by the increasing demand for high-level competitive events and training, has promoted in-depth research on precise motion analysis and intelligent assisted training technology.^1,2 As an important component of intelligent sports systems, action recognition plays a crucial role in improving performance analysis and training efficiency.³ However, action recognition in professional basketball games still faces many challenges. Especially in high-intensity competitions and complex background scenes, the rapid movement changes and frequent occlusions of athletes make it difficult for traditional detection algorithms to effectively capture target features, thereby affecting recognition accuracy and real-time performance.⁴ As a classic One-stage object detection algorithm, single shot multi-box detector (SSD) is widely used in real-time object detection tasks due to its efficient detection speed and strong small object localization ability.⁵

Literature review

Wang et al. proposed an improved model based on feature pyramid and SSD. By integrating multi-scale feature maps and introducing attention mechanisms, the detection ability of small targets was enhanced. The experimental results showed that on the PASCAL VOC2007 test set, the average accuracy of the algorithm reached 78.3%, which was 1.1% higher than the original algorithm.⁶ Sogabe et al. identified the position and type of instruments in ophthalmic surgical microscope images through the SSD model to detect the main surgical tools in cataract surgery, overcoming the image distortion caused by gas-liquid interface refraction and viewpoint scaling. The results showed that the average accuracy of this method in surgical images was 0.75.⁷ Basit et al. proposed a deep learning method based on SSD, combined with the lightweight neural network architecture MobileNet v2, to address the challenge of improving accuracy in brain tumor detection. Even in low resolution MRI scans, this method could detect the smallest tumor spots. The results showed that the accuracy of the model reached 98%.⁸ Yang et al. proposed a new object detection framework to address the shortcomings of the SSD model in detecting small objects. The backbone network of the original model was replaced. A partially convolutional network was adopted to reduce computational complexity. The feature extraction capability was enhanced through a multi-scale fusion network. The results showed that the model achieved an average accuracy of 80.38%, with a complexity only half that of SSD.⁹ Zhang et al. proposed a flight trajectory tracking method based on video signal filtering to address the issue of motion tracking in basketball games. Specifically, the adaptive median filtering algorithm is used to filter the basketball flight video signal. After median filtering, select the image difference to enhance the basketball flight trajectory image, and then use Harris corner detection algorithm to enhance the image. The experimental results have demonstrated the effectiveness of this method.¹⁰

In the field of basketball detection and recognition, various methods have been proposed to improve detection accuracy and action recognition ability. Ye et al. proposed a comprehensive action recognition system for tennis serve training based on inertial sensors, aiming to assist in tennis serve training by collecting athlete action data using multiple inertial sensors. The system adopts small dataset data preprocessing and convolutional neural network method based on local attention mechanism for recognition. The experimental results show that this method is effective.¹¹ Pengyu et al. proposed a fast skeleton extraction and model segmentation method to address the accuracy issue of target recognition in basketball player detection technology. Starting from the characteristic information of basketball target states, this study compared and analyzed various detection methods. A new strategy for constructing dynamic ellipsoid bounding boxes was proposed to replace traditional bounding box methods.¹² Li et al. proposed a hybrid system for recognizing basketball players’ movements under complex backgrounds and inconsistent lighting conditions. The enhanced You Only Look Once (YOLO) was used for player detection, combined with long short-term memory networks and fuzzy logic for action classification. The results showed that the system achieved a recognition accuracy of 99.3% for 8 types of actions on different basketball videos.^13,14 Facchinetti et al. proposed an algorithm based on athlete tracking data to automatically identify active periods in basketball games, addressing the issue of alternating between active and inactive periods. The data from GPS sensors was utilized and the performance was evaluated through ROC. The algorithm accurately identified the active period of the game by setting thresholds applicable to player motion parameters.¹⁵ Liu et al. proposed a universal jersey number detector to address the challenges of player location and identity recognition. The detector predicted the bounding boxes, keypoints, and jersey number categories of players through an end-to-end model, rather than generating numerical suggestions based on preset anchor points. The results showed that the model had good detection performance.^16,17

In summary, traditional methods perform poorly in small object detection and temporal motion capture, resulting in low recognition accuracy in scenarios such as rapid movement, complex backgrounds, and frequent occlusions during professional basketball training action recognition. In view of this, this study designs a Multi-scale Feature Enhanced SSD (MSA-SSD) that combines multi-scale feature integration with dual attention mechanism, and an action recognition model based on Dual scale 3D Convolutional Network (DS-3DNet). The innovation of the research lies in achieving multi-scale feature fusion and global contextual information enhancement through the Pyramid Feature Integration Module (PFIM) and the Global Information Integration Module (GIIM). At the same time, the DS-3DNet architecture combines adaptive weight fusion and attention mechanism to enhance the ability of athletes to extract spatial details and temporal features of their movements. The research aims to solve the technical difficulties of action recognition in complex basketball game scenarios, and provide more efficient solutions for sports performance analysis and intelligent assisted training in practical applications.

Methods and materials

Design of object detection algorithm based on improved SSD

The accuracy and real-time performance of action recognition technology are crucial for sports performance analysis and intelligent assisted training systems in professional basketball games and training scenarios.¹⁸ SSD detects targets simultaneously on feature maps of different scales, avoiding the process of generating candidate boxes using the Two-stage algorithm and significantly improving detection efficiency.^19,20 However, considering that in professional basketball games, due to the varied movements of athletes and frequent target occlusion, it will further exacerbate missed and false detections in the SSD model. Therefore, in order to improve its detection performance in the complex environment of professional basketball games, the research attempts to introduce PFIM and GIIM to enhance its shortcomings in occlusion and small object detection, and provide more accurate object detection results for subsequent action recognition.

Firstly, in the PFIM module, traditional convolutional networks struggle to fully capture multi-scale semantic features. Therefore, a Multi-path Parallel Residual Module (MPRM) is designed, as shown in Figure 1.

Figure 1.

Residual unit structure and MPRM structure. (a) Residual unit structure, (b) MPRM structure.

Figure 1(a) shows the basic residual unit structure diagram, which includes 1 × 1 convolutional layer, 3 × 3 convolutional layer, and ReLU activation function. The gradient vanishing problem in deep networks is alleviated through skip connections. Figure 1(b) shows the structure diagram of the MPRM, which has multiple convolution kernels of different sizes such as 3 × 3, 5 × 5, and 7 × 7 for parallel feature extraction, in order to increase the receptive field and multi-scale feature fusion ability. Compared to the basic residual unit, the MPRM module can capture richer contextual information and detailed features, making it more suitable for small object detection tasks in complex backgrounds. $X_{1}$ to $X_{4}$ represent the features of different processing stages within the residual block, including input, channel adjusted features, spatially filtered features, and linear projection results before adding to the input.

Subsequently, in order to perform more accurate upsampling and improve the image’s detail preservation ability, Sub-pixel Convolution (SPC) is used to enhance the resolution of basketball game images. Compared to upsampling and deconvolution, SPC not only improves computational efficiency, but also enhances image detail representation. Therefore, the final PFIM module structure is shown in Figure 2.

Figure 2.

PFIM module structure.

As shown in Figure 2, in the PFIM module, the input feature map undergoes convolutional and compression operations to extract preliminary multi-scale features, and preserves shallow positional information and deep semantic information through skip connections. Subsequently, the module uses multiple convolution kernels of different sizes to process the features, extracting rich multi-scale features. The feature maps are gradually fused after downsampling to improve detection performance in small targets and complex backgrounds.

Subsequently, GIIM aims to supplement the global contextual information that is not captured in the PFIM module. Its core is the Dual-axis Convolutional Perception Module (DCPM), which enhances the depth of input features through spatial and channel dual convolution attention mechanisms. DCPM performs dual attention enhancement on the input feature map. By jointly modeling the spatial and channel dimensions, significant regions in the feature map are precisely focused to enhance the fusion effect of contextual information. By calculating Global Average Pooling (GAP) and Global Max Pooling (GMP), spatial and channel features are obtained. The convolutional layer is used to fuse spatial information. Spatial weight matrices $W_{s 1}$ and $W_{s 2}$ are obtained, as shown in equation (1).

{\begin{cases} W_{s 1} = σ (C o n v_{1 \times 1} (G A P (U_{1}) + G M P (U_{1}))) \\ W_{s 2} = σ (C o n v_{1 \times 1} (G A P (U_{2}) + G M P (U_{2}))) \end{cases}

(1)

In equation (1), $σ$ is the Sigmoid activation function. $C o n v_{1 \times 1}$ is a 1 × 1 convolution operation. Subsequently, global feature aggregation is performed on $U_{1}$ and $U_{2}$ in the channel dimension. The channel attention weight matrices $W_{c 1}$ and $W_{c 2}$ are calculated using MLP, as shown in equation (2).

{\begin{cases} W_{c 1} = σ (M L P (G A P (U_{1}) ‖ G M P (U_{1}))) \\ W_{c 2} = σ (M L P (G A P (U_{2}) ‖ G M P (U_{2}))) \end{cases}

(2)

In equation (2), $M L P$ is a multi-layer perceptron structure. Specifically, it consists of two fully connected layers: the first layer reduces the channel dimension by a factor of 16 and applies a ReLU activation, and the second layer restores the original channel dimension followed by a Sigmoid activation to obtain the attention weights. Finally, the spatial and channel attention weights are fused and reweighted to obtain the enhanced output feature maps $U_{1}^{'}$ , and $U_{2}^{'}$ , as shown in equation (3).

{\begin{cases} U_{1}^{'} = W_{s 1} \otimes W_{c 1} \otimes U_{1} \\ U_{2}^{'} = W_{s 2} \otimes W_{c 2} \otimes U_{2} \end{cases}

(3)

In equation (3), $\otimes$ represents the element wise multiplication operation. The use of element wise multiplication instead of addition or concatenation is because element wise multiplication can effectively maintain the relative proportional relationship between the features of each part, thereby highlighting important features and suppressing unimportant features. In contrast, addition can lead to the merging of feature subspaces, which may result in the loss of important information, while concatenation increases dimensionality, which may increase computational complexity and overfitting risk. Finally, $U_{1}^{'}$ and $U_{2}^{'}$ are further processed by the MPRM module and SPC module, and output as $U_{1}^{″}$ through skip connections, achieving the integration of global contextual information and enhancing feature expression. Combining PFIM and GIIM modules, the improved MSA-SSD object detection model structure is shown in Figure 3.

Figure 3.

MSA-SSD module structure. (a) 2D convolution operation, (b) 3D convolution operation.

As shown in Figure 3, the overall structure of the MSA-SSD model includes a basic feature extraction network, PFIM, and GIIM, through which multi-scale feature fusion and contextual information enhancement are achieved. PFIM receives feature maps of different scales as input, extracts rich semantic features through multi-path parallel convolution, and generates fused output feature maps; GIIM further processes these features, enhances the expressive power of feature maps through spatial and channel attention mechanisms, and finally outputs enhanced contextual information. These enhanced feature maps are then used in the final stage of object detection to achieve high-precision object recognition and localization.

Construction of training action recognition model based on 3D convolution

After completing the target detection part of basketball game training, the extracted athlete action areas need further temporal feature analysis to accurately identify complex action sequences. In professional basketball games and training scenarios, athlete action recognition not only needs to focus on spatial features, but also needs to capture temporal variation information of consecutive frames in the video.²¹ 3D Convolutional Neural Network (3D-CNN) is used to process spatial and temporal features in video data, especially for capturing detailed changes in athlete movements and recognizing training movements. Its structure is shown in Figure 4.^22,23

Figure 4.

2D and 3D convolution operations.

In the 2D convolution of Figure 4(a), the convolution kernel slides along the height and width dimensions of the input feature map to extract local spatial features of the image through spatial convolution, which is suitable for single frame image processing. In the 3D convolution of Figure 4(b), the convolution kernel slides in the three dimensions of height, width, and time, which can simultaneously capture spatial and temporal variation features in video data.^24,25 However, considering the significant computational and memory consumption issues of standard 3D convolutional networks when processing high-resolution videos, a dual scale 3D-CNN architecture is designed. By combining high and low resolution branches, a feature fusion strategy is used to capture fine-grained spatial details and global temporal information of athlete movements, in order to better capture fine-grained spatial details and global temporal information and reduce computational costs.

Firstly, to address the potential issues of gradient vanishing and exploding caused by random initialization, the study adopts a mixed initialization strategy of Xavier initialization and He initialization to ensure a reasonable weight distribution in each layer of the network. In shallow convolutional networks, due to the relatively uniform number of input and output nodes, using Xavier initialization can maintain the variance stability of weights, as shown in equation (4).

W \sim u (- \sqrt{\frac{6}{n_{i n} + n_{o u t}}}, \sqrt{\frac{6}{n_{i n} + n_{o u t}}})

(4)

In equation (4), $W$ is the initialized weight matrix. $n_{i n}$ and $n_{i n}$ are the number of nodes in the input layer and output layer, respectively. $u$ represents a uniform distribution within a specified interval. In deep networks, He initialization can further enhance the nonlinear expression ability of the model by increasing the variance of weights, ensuring the gradient stability of the deep network, as shown in equation (5).

W \sim N (0, \frac{2}{n_{i n}})

(5)

In equation (5), $N (0, \frac{2}{n_{i n}})$ represents a normal distribution with a mean of 0 and a variance of $σ^{2}$ . The calculation method of $\frac{2}{n_{i n}}$ for weight variance can effectively alleviate the gradient vanishing in deep convolutional networks. Subsequently, in the feature fusion stage, weight addition is used to weight and overlap the feature vectors extracted from the high and low resolution branches, resulting in a corresponding 1024 dimensional feature vector, as shown in equation (6).

Y = X_{i, j, d}^{a} + X_{i, j, d}^{b}

(6)

In equation (6), $Y$ is the feature representation obtained by adding weights. $X_{i, j, d}^{a}$ and $X_{i, j, d}^{b}$ represent the feature map elements of the high and low resolution branches, respectively. $i$ , $j$ , and $d$ respectively, represent the width, height, position, and channel index of the feature map. Equation (6) is the initial step of feature fusion, which achieves preliminary integration of high and low resolution branches by directly adding ( $X^{a} + X^{b}$ ). In addition, to fully utilize the complementary advantages of high and low resolution features, adaptive weight coefficients are introduced to dynamically adjust weights based on the saliency of the feature map. The GAP is used to dynamically determine the adaptive weight coefficients, as shown in equation (7).

{\begin{cases} α = G A P (X^{a}) \\ β = G A P (X^{b}) \end{cases}

(7)

In equation (7), $X^{a}$ and $X^{b}$ represent high and low resolution branches, respectively. $α$ and $β$ are the adaptive weight coefficients for high and low resolution feature maps, respectively. Subsequently, the feature maps of the two branches are weighted and fused using adaptive weights, as shown in equation (8).

Y^{'} = α \cdot X_{i, j, d}^{a} + β \cdot X_{i, j, d}^{b}

(8)

In equation (8), $Y^{'}$ is the fused feature map. Adaptive weight fusion dynamically generates weight coefficients through global average pooling, and its advantage lies in the ability to adjust weights based on the saliency of input features. To further enhance the significant feature expression after fusion, suppress background noise and unimportant feature regions, spatial and channel attention mechanisms are applied on $Y^{'}$ , the attention weight matrix is calculated, and the feature map is re-weighted, as shown in equation (9).

Y^{″} = A t t (Y^{'}) \otimes Y^{'}

(9)

In equation (9), $Y^{″}$ is the final reweighted feature vector. $A t t (Y^{'})$ is the attention weight matrix calculated through channel attention and spatial attention. Channel attention can learn the importance of different channels, while spatial attention can focus on significant spatial regions. The final output suppresses noise and enhances task related features through attention weighting. Next, to further improve the accuracy of training action recognition, Support Vector Machine (SVM) is selected as the classifier. $Y^{″}$ is input to SVM for training action recognition. DS-3DNet uses SVM for classification after extracting multidimensional features, in order to utilize SVM’s powerful classification ability in small sample learning and high-dimensional feature space, thereby improving the accuracy of action recognition. In contrast, using softmax output for end-to-end training may lead to overfitting of the model to the training data, especially in complex action scenarios, where SVM’s advantages in optimizing classification boundaries cannot be fully utilized. And its working principle is shown in Figure 5.

Figure 5.

Schematic diagram of SVM.

As shown in Figure 5, the input features may be nonlinear and inseparable in the original space. SVM maps these features to a high-dimensional feature space through kernel function $ϕ$ . In high-dimensional space, SVM searches for the optimal classification hyperplane that can maximize the interval, classifying complex nonlinear data. Therefore, based on the above improvements, the final DS-3DNet basketball training action recognition model structure is shown in Figure 6.

Figure 6.

DS-3DNet module structure.

As shown in Figure 6, the DS-3DNet training action recognition model first uses two types of input feature maps, high-resolution, and low resolution, which are 112 × 112 and 64 × 64, respectively, and input them to two parallel 3D convolutional neural network branches. The high-resolution branch is used to extract fine-grained spatial features of athlete movements in video clips and capture small changes in movements. The low-resolution branch is used to capture global temporal features, adapting to the modeling requirements of long-term action sequences while reducing computational costs. Secondly, the feature maps extracted by the two branches are fused through the feature representation module, integrating spatial details and temporal information. Subsequently, the fused feature vectors are input into SVM for action classification, and the final decision result is output.

Methodology

Based on the above theory, the MSA-SSD model is proposed to improve the recognition and detection accuracy of athlete movements in professional basketball games. The entire process of the proposed method is shown in Figure 7.

Figure 7.

Flow chart of basketball game training action recognition technology.

As shown in Figure 7, this process achieves efficient object detection by gradually passing the input video frames to different processing modules. Firstly, the video frames are processed through a feature extraction network to generate preliminary feature maps, which are then passed on to PFIM. In this module, multi-scale semantic information is extracted through multi-path parallel convolution to enhance feature expression. The fused features are input into GIIM, and spatial and channel attention mechanisms are used to further optimize the features. Finally, the most representative detection boxes are selected through NMS, and the final detection result containing category labels and bounding box coordinates is output.

Results

Performance testing of MSA-SSD object detection model

The experiment is conducted in the Python 3.8 environment and the PyTorch deep learning framework is used for model training and testing. The experimental hardware configuration includes a workstation equipped with an Intel Xeon W-2295 processor, NVIDIA RTX 3090 GPU 24 GB of graphics memory, and 128 GB of memory. During the model training process, random weights are initialized and gradient updates are performed using the Adam optimizer. The initial learning rate is 0.001 and the batch size is 32. The dataset selected is SportsMOT, which covers three sports scenes: basketball, volleyball, and soccer. It contains 240 video sequences, over 150,000 frames of images, and over 1.6 million bounding boxes. The video resolution is 720P and the frame rate is 25FPS. The detection time is measured per frame to evaluate its performance in real-time applications. GPU acceleration is utilized in the model training and inference process to ensure real-time object detection in complex scenes. During SSD training, the training dataset used is 240 video sequences, containing over 150,000 frames of images and 1.6 million bounding boxes. In order to improve the robustness and accuracy of the model, a data augmentation strategy including random cropping, horizontal flipping, brightness and contrast adjustment was adopted, aiming to simulate various transformations that may be encountered in real environments. In the MSA-SSD model, key hyperparameters include using a learning rate of 0.001, batch size of 32, and training with 500 iterations, while introducing a dropout rate of 0.5 to prevent overfitting. For the DS-3DNet model, a learning rate plan combining Xavier and He initialization is adopted, and convolutional layers are stacked with a depth of 6 layers, while feature fusion is achieved through dynamic weight adjustment and fully connected layers. The training time of DS-3DNet is about 40 h, with a total of 500 iterations. The dataset is divided into training set (70%), validation set (15%), and test set (15%) to ensure uniform distribution of samples in different scenes and lighting conditions. Action tags include “dribbling,” “passing,” “shooting,” and “dunking,” each containing at least 200 samples to provide a reliable training foundation. The annotation process uses professional personnel to annotate the collected 1500 video clips frame by frame, recording the start and end times of each action, and assigning corresponding bounding boxes and action labels to each athlete to ensure the accuracy and consistency of the annotation. During the training process, PFIM and GIIM will be jointly trained with the SSD model to ensure effective collaboration between feature extraction and contextual information enhancement. This joint training strategy can improve the overall performance of the model in complex environments, while avoiding feature inconsistency issues that may arise from independent training of certain modules. Firstly, to verify the impact of each module on the final model, the results of ablation testing are shown in Table 1.

Table 1.

Ablation test results.

Experimental group	Model configuration	mAP (%)	mAP p-value	FPS	FPS p-value
Baseline	SSD	87.9	/	39.7	/
Single component	SSD + PFIM	90.7	0.003**	38.1	0.12
Single component	SSD + GIIM	89.9	0.008**	37.6	0.04*
Combining model	MSA-SSD	93.3	<0.001***	36.3	0.002**

*:p < 0.05, **:p < 0.01, ***:p < 0.001.

In Table 1, the indicators include Mean Average Precision (mAP) and Frame Per Second (FPS). The mAPs of the basic SSD, SSD + PFIM, SSD + GIIM, and MSA-SSD models were 87.9%, 90.7%, 89.9%, and 93.3%, respectively. The FPS of the basic SSD, SSD + PFIM, SSD + GIIM, and MSA-SSD models were 39.7, 38.1, 37.6, and 36.3, respectively. The improved model incorporating PFIM and GIIM modules showed significant improvement in accuracy throughout the entire training process. The PFIM module enhances the detection ability of small targets and complex backgrounds through multi-scale feature fusion, while the GIIM module improves the model’s feature expression ability by capturing global contextual information. The final MSA-SSD model combines these two improvements, demonstrating the highest detection accuracy. The statistical test results show that the single component models SSD + PFIM and SSD + GIIM have significant improvements in mAP, while the combined model MSA-SSD has a more significant improvement in mAP, and also shows a statistically significant decrease in FPS. Subsequently, SSD, Convolutional Encoder Network (CENet), and Adaptive Convolutional Neural Network (ACNet) are selected as comparison models. To further validate the detection capability of the model in actual basketball game scenarios, especially in complex situations where the target is partially occluded, a qualitative analysis experiment is designed. By displaying the detection results of people and basketballs under different degrees of occlusion, the robustness performance of the model in fine-grained object detection and complex scenes is shown in Figure 8.

Figure 8.

Target detection qualitative analysis results. (a) SSD, (b) CENet, (c) ACNet, (d) MSA-SSD.

In Figure 8(a), SSD exhibited missed detection and false detection in all images, especially in Image 2 and Image 4, where the detection performance for occluded targets was poor. In Figure 8(b), CENet improved the detection ability in complex backgrounds, but its accuracy in detecting small targets such as basketballs was insufficient. In Figure 8(c), ACNet significantly improved the small object detection performance by introducing an adaptive convolution strategy, but there was missed detection in multiple occlusions. In Figure 8(d), MSA-SSD performed the best, accurately detecting players and basketballs in all test images, and maintaining high detection accuracy even in complex occlusion scenes. Its advantage lies in the introduced PFIM and GIIM, which enhances the fusion ability of multi-scale features, and enhances adaptability to complex scenes through global contextual information capture, effectively reducing missed and false detection rates. In action recognition, error detection is defined as the model incorrectly recognizing or classifying an action, such as mistaking an athlete’s “passing” action for “dribbling” or not detecting the action at all. Finally, the detection performance test results of each model under different environmental conditions are shown in Table 2.

Table 2.

Comprehensive evaluation of multiple environments and conditions.

Environment	Model configuration	Average detection accuracy/%	Missed rate/%	False detection rate/%	Detection time/ms	Relative improvement of MSA-SSD (%)
Low-light environment	SSD	70.3	15.7	14.6	28.5	17.69
	CENet	73.8	12.9	12.4	32.4	11.62
	ACNet	76.7	10.3	10.5	35.1	7.39
	MSA-SSD	82.4	7.3	8.7	30.8	/
	YOLOv8	79.2	9.6	9.1	29.8	6.46
	TimeSformer	81.3	8.1	9.3	32.3	1.35
Strong light interference	SSD	68.1	18.7	16.8	28.8	178.12
	CENet	70.5	16.4	14.7	33.2	13.07
	ACNet	74.6	12.6	11.9	35.5	7.10
	MSA-SSD	80.2	8.6	9.4	31.2	/
	YOLOv8	77.6	10.3	10.2	30.1	3.14
	TimeSformer	78.7	9.2	10.1	31.6	1.87
Complex background	SSD	65.9	20.3	18.7	29.1	19.69
	CENet	68.7	17.4	16.2	32.8	14.79
	ACNet	73.5	14.2	13.1	34.7	7.36
	MSA-SSD	78.9	10.4	10.8	30.5	/
	YOLOv8	76.4	11.6	11.3	30.3	3.27
	TimeSformer	77.3	10.7	11.1	31.4	2.66
Severe occlusion	SSD	63.7	22.6	20.9	28.9	22.32
	CENet	66.9	19.3	18.1	33.3	16.45
	ACNet	71.4	15.6	14.3	35.9	8.51
	MSA-SSD	77.8	11.1	11.6	31.1	/
	YOLOv8	75.3	12.2	12.4	30.3	2.61
	TimeSformer	76.2	11.1	12.6	30.9	2.43

Note. Mean Average Precision (mAP): The average detection accuracy of the model over the entire test set. Missed Rate (MR): The ratio of the number of missed targets to all targets. False Detection Rate (FDR): The ratio of the number of false detection targets to all detected targets. Detection Time (DT): The time required by the model to process one frame of image, measured in milliseconds. Top-1 Accuracy: The proportion of the previous option that the model can accurately predict. Top-5 Accuracy: The proportion of true categories included in the first five options of the model output. Temporal Consistency Score (TCS): A quantitative indicator of the consistency of results output at different time points. Recognition Jitter Rate (RJR): An indicator of output instability.

As shown in the Table 2, under low light conditions, the average detection accuracy of MSA-SSD reached 82.4%, with a missed detection rate of only 7.3%. This performance is attributed to the PFIM module’s ability to enhance detailed features under low light conditions, thereby improving detection accuracy. In strong light interference and complex background environments, the false detection rates of MSA-SSD were 9.4% and 10.4%, respectively. The main reason is that the GIIM module can capture global contextual information and effectively distinguish between targets and backgrounds. In addition, MSA-SSD exhibited strong robustness in severely occluded environments, with a missed detection rate of 11.1%, which was 4.5% lower than ACNet. Although the detection time is slightly higher than SSD, its improved accuracy and robustness make it more suitable for real-time object detection tasks in complex basketball game scenes.

Application analysis of DS-3DNet training action recognition model

The experimental environment is the same as the previous section, using NVIDIA A100 GPU for model training and testing. The comparative models are Inflated 3D ConvNet (I3D), Temporal Relationship Network (TRN), and Slow Fast Network (SFNet). The study collects live videos of 20 professional basketball games, including central field perspective, sideline perspective, and audience perspective. There are a total of 1500 video clips with an average duration of about 20 s, covering various movements of athletes during the competition, including different lighting conditions, occlusion situations, and background interference. The video segments in the dataset are sliced, key frames are extracted, and standardized to a uniform resolution of 112 × 112, with a frame rate of 30FPS maintained. All video clips are professionally annotated, including detailed action categories, time intervals, and athlete target box information, to ensure the accuracy and consistency of data annotation. Firstly, the results of Top-1 and Top-5 accuracy tests on each model are shown in Figure 9.

Figure 9.

Top-1 and Top-5 test results. (a) Top-1 accuracy test, (b) Top-5 accuracy test.

Figures 9(a) and (b) show the Top-1 and Top-5 results of each model, respectively. In Figure 9(a), when the number of iterations was 500, the Top-1 accuracy of I3D, TRN, SFNet, and DS-3DNet was 91.1%, 89.3%, 93.2%, and 94.5%, respectively. In Figure 9(b), when the number of iterations was 500, the Top-5 accuracy of each model was 94.2%, 96.4%, 96.9%, and 98.2%, respectively. The advantages of DS-3DNet are mainly attributed to the optimization of feature fusion strategy and temporal modeling module, which effectively enhances the robustness of the model to uncertain action categories. In addition, although SFNet and TRN have certain advantages in temporal modeling, their feature extraction capabilities are insufficient, resulting in slightly inferior overall performance compared to DS-3DNet. The Top-1 accuracy of 94.5% was obtained under specific training and testing conditions, so its accuracy may decrease in unseen teams or different camera angles. Subsequently, to test the stable performance of the model in long-term videos, 200 video clips with a duration of more than 20 s are collected from the dataset for testing. The results are shown in Figure 10.

Figure 10.

TCS and RJR test results. (a) TCS test, (b) RJR test.

Figures 10(a) and (b) show the Temporal Consistency Score (TCS) and Recognition Jitter Rate (RJR) test results of the model at different video lengths, respectively. In Figure 10(a), when the video duration was 50 s, the TCS of I3D, TRN, SFNet, and DS-3DNet was 90.1%, 87.3%, 91.2%, and 93.8%, respectively. In Figure 10(b), when the video duration was 50 s, the RJR of each model was 11.5%, 11.9%, 10.3%, and 7.9%, respectively. The high-low dual resolution architecture and enhanced temporal modeling capability of DS-3DNet enable it to more accurately capture motion change features in long temporal videos, while effectively suppressing temporal noise and unstable detection results. Finally, in order to comprehensively evaluate the robustness of the model in action recognition under different complex backgrounds, especially in interference environments and rapidly changing action scenes, detection experiments are designed under various background interference conditions. The recognition accuracy and robustness scores of models in simple, medium, and complex backgrounds are compared in Table 3. The robustness score represents the average per-class recall across all tested action categories under each background condition. This metric reflects the model’s consistency in correctly identifying various actions in challenging environments.

Table 3.

Basketball action detection results of the model under different backgrounds.

Basketball action	Model	Simple background	Medium background	Complex background	Robustness score /%
Dribbling	I3D	89.1	85.7	82.3	85.6
	TRN	90.3	87.1	83.8	86.4
	SFNet	91.8	88.5	85.4	87.8
	DS-3DNet	92.8	89.6	86.4	88.5
Passing	I3D	87.5	84.2	80.1	83.9
	TRN	88.7	85.8	81.7	85.4
	SFNet	90.1	86.9	83.5	86.8
	DS-3DNet	91.5	87.8	84.2	86.7
Shooting	I3D	91.2	87.9	84.5	87.9
	TRN	92.4	89.2	85.9	88.7
	SFNet	93.6	90.5	87.5	89.8
	DS-3DNet	94.2	91.1	88.1	90.8
Dunking	I3D	92.2	89.4	85.3	88.8
	TRN	93.1	90.2	86.5	89.7
	SFNet	94.5	91.8	87.8	91.4
	DS-3DNet	95.6	92.7	89.3	91.9

As shown in Table 3, DS-3DNet outperformed other models in terms of detection accuracy and robustness scores under all background conditions, especially in complex backgrounds. The dual scale architecture of DS-3DNet significantly enhanced the model’s adaptability to background noise and accuracy in capturing subtle motion features. Specifically, the robustness score of DS-3DNet in complex backgrounds reached 91.9%, which was about 3%–5% higher than models such as I3D and TRN. In contrast, models such as I3D and TRN had poor adaptability to complex backgrounds and were easily affected by background noise, resulting in a significant decrease in detection accuracy and robustness scores. This indicates that DS-3DNet has stronger adaptability to complex scenarios in practical applications, which is suitable for high demand action recognition tasks.

Conclusion

In response to the low accuracy and poor real-time performance of action recognition in professional basketball games and training scenarios, the MSA-SSD object detection model and DS-3DNet action recognition model were proposed. The performance test results of the MSA-SSD model showed that when the number of iterations was 500, the mAP and FPS of the MSA-SSD model were 93.3% and 36.3, respectively. In qualitative testing, players and basketballs that could detect 4 images without missed or false detections were superior to the comparison model. Finally, its average detection accuracy under low light conditions was 82.4%, with a missed detection rate of 7.3% and a false detection rate of 8.7%. The detection time was 30.8 ms, achieving a balance between high accuracy and computational efficiency. In the DS-3DNet application test, when the number of iterations was 500, its Top-1 and Top-5 accuracy was 94.5% and 98.2%, respectively. On a video detection with a duration of 50 s, its TCS and RJR were 93.8% and 7.9%, respectively. The highest action recognition accuracy in a simple background was 95.6%, with a robustness score of 92.7. The highest action recognition accuracy in complex backgrounds was 89.3%, with a robustness score of 91.9. The results showed that the MSA-SSD and DS-3DNet models exhibited excellent detection and action recognition performance in complex basketball game videos, with excellent recognition accuracy and robustness.

However, there are still shortcomings in the research, namely, there is still room for improvement in detection time and computational complexity. Future work will focus on exploring lightweight network structures and multi-modal fusion strategies to further enhance the model’s generalization ability and computational efficiency, and explore the application effects of the model in other types of motion. In addition, the real-time deployment of the model may be affected by computational resource limitations, especially when running on low-end devices. Future research can explore extending this model to other team sports fields and consider making the relevant code and supplementary materials public to promote transparency and reproducibility of the study. The generalization ability of the proposed model can be further explored by testing it on out-of-distribution data, including unseen teams and different camera angles.

Footnotes

ORCID iD

Rui Tian

Funding

The author received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

All data generated or analyzed during this study are included in this article.*

References

Mokayed

Quan

Alkhaled

, et al. Real-time human detection and counting system using deep learning computer vision techniques. Artif Intell Appl 2023; 1(4): 221–229.

Antali

Havas

Hogan

, et al. Nonlinear dynamics of a basketball rolling around the rim. Nonlinear Dyn 2021; 104(4): 3013–3037.

Jiang

Zhang

. Deep learning algorithm based wearable device for basketball stance recognition in basketball. Int J Adv Comput Sci Appl 2023; 14(3): 25–33.

Barrons

Esposito

Stefanyshyn

, et al. The traction requirements of female and male basketball players. Footwear Sci 2023; 15(1): 28–34.

Wang

Shoulin

Alyami

, et al. A novel deep learning‐based single shot multibox detector model for object detection in optical remote sensing images. Geosci Data J 2024; 11(3): 237–251.

Wang

Liu

Guo

. An object detection algorithm based on the feature pyramid network and single shot multibox detector. Clust Comput 2022; 25(5): 3313–3324.

Sogabe

Ito

Miyazaki

, et al. Detection of instruments inserted into eye in cataract surgery using single-shot multibox detector. Sensor Mater 2022; 34(1): 47–54.

Basit

Ullah

Noor

, et al. Brain tumor detection in MRI scans using single shot multibox detector. J Intell Fuzzy Syst 2022; 43(2): 1985–1993.

Yang

Huang

Tan

, et al. A small object detection method based on SSD model. Signal Image Video Process 2024; 18(1): 173–180.

10.

Zhang

Alshawi

, et al. Basketball flight trajectory tracking using video signal filtering. Mob Netw Appl 2023; 28(5): 1699–1711.

11.

Zhu

, et al. Comprehensive tennis serve training system based on local attention-based CNN model. IEEE Sens J 2024; 24(7): 11917–11926.

12.

Pengyu

Wanna

. Image detection and basketball training performance simulation based on improved machine learning. J Intell Fuzzy Syst 2021; 40(2): 2493–2504.

13.

Luo

Islam

. Tracking and detection of basketball movements using multi-feature data fusion and hybrid YOLO-T2LSTM network. Soft Comput 2024; 28(2): 1653–1667.

14.

Nanda

. An integrated campus radio frequency identification system on clouds analysis for improved security. Acta electron Malays 2022; 6(1): 4–6.

15.

Facchinetti

Metulini

Zuccolotto

. Filtering active moments in basketball games using data from players tracking systems. Ann Oper Res 2023; 325(1): 521–538.

16.

Liu

Bhanu

. JEDE: universal Jersey number detector for sports. IEEE Trans Circuits Syst Video Technol 2022; 32(11): 7894–7909.

17.

Shakrani

Kanyangarara

Parowa

, et al. A deep learning model for face recognition in presence of mask. Acta inform Malays 2022; 6(2): 43–46.

18.

Liu

Gao

Chi

, et al. Pedestrian detection algorithm based on improved SSD. Int J Comput Appl Technol 2021; 65(1): 25–35.

19.

Sharma

Jaffery

. Categorical vehicle classification and tracking using deep neural networks. Int J Adv Comput Sci Appl 2021; 12(9): 564–574.

20.

Mary

Koshy

. Detection and classification of vehicles using audio visual cues. Multimed Tool Appl 2023; 82(28): 44087–44106.

21.

Keceli

Kaya

. Violent activity classification with transferred deep features and 3d-Cnn. Signal Image Video Process 2023; 17(1): 139–146.

22.

Ur Rehman

Ahmed

Attique Khan

, et al. Dynamic hand gesture recognition using 3D-CNN and LSTM networks. Comput Mater Continua (CMC) 2021; 70(3): 4675–4690.

23.

Sadhin

Rayhan

. A review of convolutional neural network in emerging trends and opportunities in precision agriculture. Acta inform Malays 2023; 7(2): 67–73.

24.

Yudistira

Kavitha

Kurita

. Weakly-supervised action localization, and action recognition using global–local attention of 3D CNN. Int J Comput Vis 2022; 130(10): 2349–2363.

25.

Sadhin

. Real-time tomato leaf disease classification using convolutional neural network. Acta inform Malays 2023; 7(1): 29–32.