Visual model construction of player composition data for competitive basketball teams based on SRR-voting

Abstract

With the booming development of sports events, the demand for athlete performance evaluation has begun to increase. The visualization analysis of basketball player data is significant for the analysis of individual value and potential of athletes. However, existing data analysis methods have problems such as insufficient efficiency, low accuracy, and insufficient attention to the inherent attributes of sports events. Therefore, to better analyze the composition data of basketball team members, the study combines Synthetic Minority Over-Sampling Technique (SMOTE) and Random under-sampling techniques to handle the imbalance in the athlete dataset, and then uses Random Forest (RF) to extract data features. Based on this data processing, a ensemble learning method based on SMOTE, Random under-sampling, and RF (SRR)-Voting is proposed to predict athlete performance. The results demonstrated that in the player dataset analysis, the Precision, Recall, and F1 of the research method were 0.9486, 0.9588, and 0.9752, which were superior to comparative methods. The proposed method had a ROC value of 0.94 and a fitting value of 0.17, indicating better fitting performance. This indicates that the designed player composition data visualization analysis method based on ensemble learning can effectively analyze the various attribute data of competitive players and ordinary players, providing effective data support for the comprehensive quality analysis of competitive players.

Keywords

SMOTE Under-sampling random forest ensemble learning basketball players visualization

Introduction

Basketball, as a team sport, is widely popular both domestically and internationally. Recently, data mining approaches related to basketball have been continuously proposed to analyze and predict the trajectories of various movements in basketball, thereby achieving effective action judgment, tactical guidance, etc.^1,2 Based on this background, the analysis of the competitive ability of basketball players is particularly important. Athlete ability is a prerequisite for achieving various movements and tactics. Meanwhile, with the integration of various commercial factors in basketball games, game gambling has gained widespread support.^3,4 However, fans usually set odds based on their subjective wishes, lacking corresponding scientific basis. This also requires predicting the overall quality and ability of basketball players. In addition, the artificial intelligence technology has provided effective support for the analysis of various data in basketball games.^5,6 Affected by science and technology, especially the application of data mining techniques in sports prediction, accurate and effective prediction of comprehensive qualities for basketball players is gradually becoming possible. However, existing data processing methods such as text, video, and statistics cannot deeply explore the potential association features among players.^7,8 The data visualization method based on artificial intelligence technology can extract effective data features from various large and complex data sets, and demonstrate data change patterns more concisely and intuitively. Among them, ensemble learning models overcome the shortcomings of a single model in data feature extraction and result classification by combining multiple classifiers, thereby improving the classification and regression performance of the model. Therefore, the study integrates Synthetic Minority Over-Sampling Technique (SMOTE) and Random Under-sampling to handle the imbalance in the athlete dataset, and then uses RF to extract data features. A visualization model of basketball team player composition data based on SMOTE, Random Under-sampling, and RF (SRR)-Voting is constructed.

The explainability of machine learning algorithms directly affects the actual effectiveness of visualization. By optimizing data features and combining them with appropriate interpretation methods, the application value of machine learning algorithms can be significantly improved. Moreover, with the development of explainable artificial intelligence technology, machine learning can be better integrated with practical scenarios such as basketball tactical analysis, promoting the intelligent development of related fields. In machine learning algorithms, ensemble learning models have good data processing and adaptability, which have been applied and developed to a certain extent in various fields such as geography and healthcare. Reference 9 analyzed the corresponding features of different lithofacies using ensemble learning prediction method. This ensemble learning combined continuous constrained Boltzmann machines, Bayesian optimization, and optical gradient boosting. The results showed that the newly proposed predictor was an efficient and robust lithology prediction solver, which was worthy of widespread application in geological, geophysical, and rock physics research. Reference 10 designed an ensemble learning method on the basis of Random Forest (RF), Extreme Gradient Boost (XGBoost), Cat Boost, and Adaboost to classify cancer patients. Then, the grey wolf optimization algorithm was applied to optimize the parameters of ensemble learning. The designed method improved the F1 value by 15% compared with existing methods. Reference 11 developed a stacked ensemble learning method for epidemic prediction and health decision-making assistance processes. This method used Decision Tree (DT), RF, selection operator, and XGBoost classifier in the first level. At the second level, a Linear Regression (LR) classifier was applied to generate the final prediction. The proposed strategy effectively predicted epidemic outbreaks, which was applied to predict epidemic trends in other countries. Reference 12 designed an ensemble learning method for feature selection of network data based on the combination of optical gradient enhancement machine and RF, random gradient descent, Gaussian Naive Bayes (GNB), and Support Vector Machine (SVM). The experiment showed that this method achieved the best prediction performance in different datasets. Reference 13 designed an ensemble learning method on the basis of convolutional neural networks and visual converters for tactical decision prediction of game players. Analysis demonstrated that this method could learn players’ macro intentions in real-time strategy games.

The relevant data of basketball can provide effective support for tactical guidance and player quality analysis, thus enabling effective motion prediction and analysis. Reference 14 proposed a hybrid system for basketball player detection and motion recognition to help basketball players improve their shooting skills and accuracy. The enhanced Yolo algorithm was used to detect players in frames, and the LSTM and fuzzy logic was integrated to operate the basketball action classification. The accuracy of this model for eight basketball actions was 99.3%, with a high player detection rate. Reference 15 calculated the success probability of each pass based on the surrounding environment, including the player and the position of the ball, to determine the difficulty of the pass. Physics and machine learning models were combined to calculate the completion probability of hypothetical passes. The accuracy of successful passing calculated based on this method was 93.0%, which was much higher than existing work. Reference 16 used a Bayesian hybrid model to analyze basketball sports data. In the sports data, key stages of offensive movements, like creating space for passing, follow-up, and preparation/shooting were identified, which were related to invasive movements. This method transformed complex patterns into more coherent statistical data, bringing significant strategic advantages to sports competitions. Reference 17 proposed an basketball goal analysis system on the basis of the 6G Internet of Things. Data mining techniques were applied to excavate the factors that affect goal scoring rates and preprocess basketball images. Then, the background difference and three frame difference methods were integrated to detect targets. The results showed that the frame rate was 58fps, which met the real-time processing performance of 25fps. The error detection rate was very low, which could be used for basketball games or exams.

In summary, significant achievements have been made in integrated learning, which has been widely studied in medical, entertainment, and other information data analysis. Extensive research has also been conducted on basketball and player related aspects. However, from the above research, it can be seen that research on basketball and players mainly focuses on tracking and identifying movement trajectories, as well as predicting and analyzing various postures during sports. There is relatively little data analysis on the abilities of athletes themselves. In addition, this study also focuses on player composition evaluation, which comprehensively analyzes player composition data to more fully evaluate the overall performance of players. Therefore, combined with the advantages of ensemble learning methods in data prediction, it is adopted in basketball player related data analysis. A visualization model based on SRR-Voting is constructed. It is expected to more easily analyze the potential of players, and understand the game process and results.

Methods and materials

To better analyze the differences in characteristics between competitive basketball players and regular basketball players, this study uses Synthetic Minority Over-sampling Technique (SMOTE) to handle the imbalance of samples from different categories in the dataset. Then, the under-sampling method is applied to dislodge noise and other interfering factors from the data, and then combined with RF to select data features. Subsequently, an integrated learning method based on Voting is constructed to visualize data, in order to better discover the potential of athletes from historical data.

Player feature data classification based on SRR

With the support of various deep learning and machine learning technologies, basketball player data visualization has become an important support for player ability prediction and win rate analysis. To better achieve basketball player data analysis, a basketball player data mining algorithm is constructed based on ensemble learning technology. In basketball player data, there is a quantitative difference between regular players and competitive players. When the amount of data in one category is greater than that in another category, there will be data imbalance.¹⁸ Therefore, first of all, the SMOTE is applied to process the imbalance problem in the sample size. Its idea is to use the similarity of existing data samples in the feature space to generate more new data samples.¹⁹ In a few categories of samples, one sample $x$ is selected, and then $S$ samples $x_{s}$ are randomly obtained from K-nearest neighbors of sample $x$ . Then, new samples are synthesized between $x_{s}$ and $x$ in sequence. The relationship is shown in equation (1).

x_{n ew} = x + β \times (x_{A} - x)

(1)

In equation (1),

x_{A}

represents a sample in dataset

A

x

represents a sample.

x_{s}

represents

S

samples.

β

represents is a randomly generated weight coefficient with a value range of [0,1]. SMOTE can effectively expand the sample size of minority categories, balance the sample size, and reduce the sample gap between different categories. The SMOTE is displayed in Figure 1.

Figure 1.

Schematic diagram of the basic principle of SMOTE.

Figure 1 is a dataset with imbalanced samples, where squares represent minority class samples and ellipses represent majority class samples. The AB line in Figure 1 is used to generate new sample data and balance the sample data. In generating new sample data, there are noisy samples and a large amount of duplicate information, which directly affect the subsequent sample classification performance. Therefore, the study adopts the random under-sampling method to remove redundant information from the sample data and retain valid information. It is to randomly remove a certain proportion of samples from the majority without processing minority class samples, achieving sample balance.²⁰ Firstly, parameter $φ$ represents the difference between the majority and minority that need to be deleted. Next, based on the set initial parameter $φ$ and the difference between the majority class and minority class in the training sample, the sample size to be deleted is determined and the $φ$ value is continuously adjusted to ensure that the classifier has a stronger generalization effect on imbalanced data. The ϕ in the equation is selected by verifying the performance of the model. Specifically, different ϕ values are validated on the dataset, such as 0.5, 1, 2, and 3. Then, the maximum value of F1 score is selected. In mild imbalance, the ϕ is generally (1.0–1.5), in moderate imbalance, the ϕ is generally (1.5–3.0), and in extreme imbalance, the ϕ is generally (3.0–8.0).^21,22

After balancing the sample data, the study uses RF to achieve regression and classification of basketball player sample data. Compared with methods such as DT, RE can effectively run on large-scale datasets and relatively complex sample data. RF integrates multiple DTs for data classification. After obtaining the classification results of multiple DTs, voting is carried out based on the principle of minority obeying majority. The category with the most voting results is the final output category of the model.^23,24 The fundamental implementation process is displayed in Figure 2.

Figure 2.

RF ensemble classification structure for player data.

Due to the large amount of data for basketball players, the feature values they cover are usually relatively rich. Therefore, before model training, it is necessary to first select the data features. When performing regression classification on data, RF first ranks the importance of each feature value and then uses the Gini index for feature selection. Specifically, the importance of each feature is analyzed. In a dataset ${A | A_{1}, A_{2}, . . ., A_{n}}$ containing $n$ features, the Gini index is shown in equation (2).

G i n i = \sum_{k = 1}^{| k |} \sum_{k^{'} \neq k} p_{r k} p_{r k^{'}}

(2)

In equation (2),

p_{r k}

signifies the probability of class

k

in node

r

K

represents the number of categories in the dataset.

p_{r k^{'}}

represents the probability of class

k^{'}

r

. The difference between the Gini index of node

r

before branching and the Gini index of the left and right child nodes after branching can be used to represent the importance of the eigenvalue

A_{n}

in that node, as shown in equation (3).

I_{r}^{G i n i} = G i n i_{r} - G i n {i_{l}}_{e f t} - G i n i_{r i g h t}

(3)

In equation (3),

G i n {i_{l}}_{e f t}

represents the Gini value of the left child node after branching.

G i n {i_{l}}_{e f t}

represents the Gini index of the right child node after branching. In decision tree

t

, if the node set of feature

A_{n}

N

, then the importance of

A_{n}

in the

t

-th tree is shown in equation (4).

I_{t n}^{G i n i} = \sum_{n \in N} I_{r}^{G i n i}

(4)

The importance of feature $A_{n}$ in the RF model is shown in equation (4).

I_{n}^{G i n i} = \frac{1}{m} \sum_{i = 1}^{m} V I M_{t n}^{G i n i}

(5)

In equation (5),

m

signifies the total classification trees in the RF. Based on the above analysis, for basketball player data samples, the study first uses SMOTE to generate a certain scale of new samples. To address the imbalance of majority and minority class samples in the samples, the study adopts a random under-sampling method to remove some samples, reduce the impact of noise samples and duplicate samples, and achieve a balanced number of samples in different categories.²⁵ Next, the study uses the RF method to process the sample data and select the most important feature values to construct a feature model. Based on the above content, a data processing method based on SMOTE, random under-sampling, and RF (SRR) is constructed.

Construction of player data visualization model based on SRR-voting algorithm

Data visualization is the process of transforming abstract information data into images, graphics, and interactive interfaces, in order to analyze data more effectively. When constructing visual data models, the research adopts a layered framework approach for design, which can make the entire visual model more systematic and logically clear through layering. The hierarchical framework used in the study mainly includes the basic platform layer, data layer, core algorithm layer, visual display layer, and user interaction layer. Figure 3 displays the overall structure.

Figure 3.

Data visualization model framework.

In Figure 3, the basic platform layer is used to control the overall operation of the entire framework, including data processing, algorithm operations, and data presentation. The data layer is used to input player data and preprocess the data. The core algorithm layer analyzes and processes player data using machine learning algorithms such as SMOTE and RF used in this study, and evaluates the processed data. The visual display layer visualizes the processed team data and player data separately. The user interaction layer completes the connection between data visualization and users, realizing data representation and data export. On the basis of the above framework, the data visualization algorithm of its core algorithm layer is analyzed. To achieve visualization of basketball player data, the research is conducted on predicting basketball player data based on ensemble learning ideas. The ensemble learning is to combine multiple weak classifiers to construct a strong one, thereby avoiding the shortcomings of weak classifiers and better solving regression classification problems. XGBoost is a frequently used strategy for solving classification and regression problems, widely used in data mining and machine learning. Compared with methods such as SVM, this method can effectively handle non normally distributed datasets and has better adaptability. Taking into account the accuracy and adaptability of various classifiers in the data, this study selects K-nearest neighbor (KNN) algorithm, Bagging, and XGBoost to construct ensemble learning classifiers, and then complete Voting learning. An ensemble classifiers based on voting is constructed.^26,27 KNN refers to a sample in the feature space where most of the k nearest samples is a certain category, and the sample also belongs to that category, with same characteristics. It only determines the sample class to be classified on the basis of the nearest one or a few samples in determining the classification decision. Bagging uses random sampling with dropout to obtain training data for different groups of the same size. When Bagging is running, it first trains a base classifier, then adjusts the weight of misclassified samples to gain more attention in later training. Then, it uses the adjusted samples to continue training the next base classifier, and finally achieves better classification results through cumulative adjustment.²⁸ XGBoost reduces the model complexity and enhances its generalization ability by adding regularization terms to the objective function. The reasons for choosing the above three strategies are as follows. Firstly, KNN does not require pre-set data distribution and has significantly stronger fitting ability for nonlinear decision boundaries (such as spiral distribution and multimodal distribution) than LightGBM/CatBoost based on tree splitting. Bagging can effectively reduce its variance reduction through self-sampling (Bootstrap) and parallel training of feature subspaces. XGBoost has stronger applicability for high-precision prediction scenarios. When the number of features is relatively small, it has higher accuracy than LightGBM. Therefore, when the data simultaneously has high noise, complex boundaries, and strong interpretation requirements, this integration strategy is superior to other integration strategies.^29,30 If $k$ regression trees, $L o s s$ represents the loss function, ${\hat{y}}_{m}$ signifies the predicted value, $y_{m}$ signifies the actual value, $Ω$ signifies the regularization term of complexity, and $f$ signifies the functional relationships, then the objective function of the XGBoost is displayed in equation (6).

L (ϕ) = \sum_{m} L o s s ({\hat{y}}_{m}, y_{m}) + \sum_{k} Ω (f_{k})

(6)

The regularization term for complexity is shown in equation (7).

Ω (f) = γ T + \frac{1}{2} λ {‖ w ‖}^{2}

(7)

In equation (7),

T

signifies the leaf node quantity in the current regression tree.

w

represents the weight value of the leaf node. The ensemble learning process based on Voting idea is shown in Figure 4.

Figure 4.

Voting-based ensemble learning process.

In Figure 4, after training multiple weak classifiers separately, the output results of each weak classifier are predicted and voted. The prediction result with the highest number of votes is the classification result obtained for that sample. Among them, the data features with relatively high importance ranking are used as input data for the ensemble learning model. In this algorithm, first, the data is standardized and transformed, and the model is trained and tested separately before outputting the target result. Based on the above processing, the functional modules of the data visualization system are shown in Figure 5.

Figure 5.

Visualization system function module.

Results

To analyze basketball player data, an ensemble learning method based on SRR-Voting is constructed for visualization processing. To demonstrate the effectiveness, the model performance is first validated on an imbalanced dataset. Then, the actual application effect of this method is verified based on specific datasets of basketball players.

Performance analysis of model in imbalanced datasets

To present the effectiveness of the ensemble learning method, two imbalanced datasets are first applied to compare, namely, the Credit Card dataset, which has an imbalanced rate of 4.52 and includes various data such as credit card customer arrears, credit data, and payment history. The other dataset is the Credit Fraud dataset, which includes 2 days of transaction data of a certain bank’s credit card in Europe, belonging to a highly imbalanced dataset, with an imbalanced rate of 577.88. The two datasets are separated into training and testing sets at a 7:3. The experimental environment used for the study is as follows, with the operating system of Windows 7. The CPU is 3.6 GHz, the memory is 32 GB, and the programming language is Python 3.6. The hyperparameter tuning for KNN, Bagging, and XGBoost is as follows. In KNN, the value of k determines the number of nearest neighbor samples. The study selects the optimal k value through 5-fold cross validation. The optimal k value obtained is 5. The core parameter of Bagging is the number of base learners (n_estimators), and the study evaluates its generalization performance through Out-of-Bag. Based on the Out-of-Bag error value, the optimal n_estimators obtained are 100. The XGBoost uses gbtree for model calculation, with a weight value of 1 for child nodes and a penalty coefficient of 1 for L2 regularization. This study verifies the optimal learning rate and tree depth through parameter interaction learning, and obtained a learning rate of 0.3 and a maximum depth of 6 for the decision tree.

The research method is compared with other commonly used ensemble learning combination methods, namely, logistic regression, KNN, DT, AdaBoost, Gradient Boosting Decision Tree (GBDT), Bagging (Ba), and XGBoost. Firstly, the designed method is compared with a single method. The G-mean and AUC values of the two datasets in the training set are shown in Figure 6. G-mean signifies the geometric mean of the prediction accuracy for both majority and minority class samples. When the prediction accuracy of both types of samples is good, the G-mean is the highest, which can be used to balance the classification accuracy of majority and minority classes. In Figure 6, on the Credit Card dataset, the G-means of KNN, Bagging, and XGBoost were 0.63, 0.58, and 0.62, and the AUC was 0.59, 0.67, and 0.46. On the Credit Fraud dataset, the G-means of KNN, Bagging, and XGBoost were 0.53, 0.57, and 0.058, respectively, with AUC values of 0.60, 0.61, and 0.72. In different datasets, the performance of single KNN, Bagging, and XGBoost is superior to other classification methods. Therefore, the ensemble learning method based on the above three classifiers have potential advantages in better performance.

Figure 6.

Performance testing of individual weak classifiers.

Based on the performance testing of the single weak classifier mentioned above, the ensemble learning classification method constructed is compared with the research method. The ensemble methods include LR+DT+AdaBoost (Method 1), LR+AdaBoost+GBDT (Method 2), AdaBoost+GBDT+XGBoost (Method 3), KNN+DT+AdaBoost (Method 4), and the designed KNN+Bagge+XGBoost (Method 5). The G-mean and AUC values obtained in the training set are shown in Table 1. From Table 1 and Figure 6, all ensemble learning methods had better G-mean and AUC values than a single weak classifier. Specifically, in Table 1, the G-mean and AUC values of the research method (Method 5) on the Credit Card dataset were 0.964 and 0.935, respectively. On the Credit Fraud dataset, the G-mean and AUC of Method 5 were 0.931 and 0.956, respectively. The designed method performs the best on G-means and AUC values on different datasets, as this ensemble learning method integrates the advantages of multiple weak classifiers, achieving higher performance.

Table 1.

Performance comparison of different ensemble learning methods.

Datasets	Ensemble learning methods	G-mean	AUC
Credit card	Method 1	0.874	0.895
	Method 2	0.859	0.826
	Method 3	0.791	0.746
	Method 4	0.921	0.889
	Method 5	0.964	0.935
Credit fraud	Method 1	0.795	0.816
	Method 2	0.864	0.825
	Method 3	0.897	0.916
	Method 4	0.925	0.883
	Method 5	0.931	0.956

Analysis of player data prediction effect

To verify the classification effect of the proposed method on specific team player datasets, the dataset used in the study is from the authoritative data website Basketball-Reference in the United States, which provides routine statistical data of players in NBA games. A crawler system is used to obtain player data for various games from 1990 to 2020, including player age, hit rate, and number of mistakes, totaling 22,611 pieces of data, with “Class” as the label field, the Class value of competitive players being 1, and the Class value of ordinary players being 0. The predictive ability of the model is comprehensively evaluated using precision, recall, and F1 value. From Table 2, the precision of the research method was 0.9486, while method 1, method 2, method 3, and method 4 were 0.9125, 0.9364, 0.9208, and 0.9516. The recall and F1 value of the research method were 0.9588 and 0.9752. The recall of method 1, method 2, method 3, and method 4 was 0.9344, 0.9521, 0.9150, and 0.9347. The F1 value of method 1, method 2, method 3, and method 4 was 0.9202, 0.9261, 0.8907, and 0.9678. From this perspective, the precision of the research method is not the best among all methods, but its recall and F1 value are both the best, which is superior to the comparative methods. Therefore, overall, the research method also has better player prediction ability in specific basketball player datasets.

Table 2.

Comparison of predictive performance of different ensemble learning methods.

Methods	Evaluating indicator
Methods	Precision	Recall	F1
Method 1	0.9125	0.9344	0.9202
Method 2	0.9364	0.9521	0.9261
Method 3	0.9208	0.9150	0.8907
Method 4	0.9516	0.9347	0.9678
Method 5	0.9486	0.9588	0.9752

The ROC curve of each ensemble learning method in the basketball player dataset is displayed in Figure 7. According to Figure 7(a), the ROC values of the four comparison methods were 0.83, 0.85, 0.88, and 0.89, respectively. The ROC value of the designed method was 0.94. In Figure 7(b), the fitting values of the four comparison methods were 0.28, 0.22, 0.37, and 0.23, and the fitting value of the designed method was 0.17. This result indicates that the designed ensemble learning method has the optimal ROC value and fitting effect.

Figure 7.

ROC curves and fitting effect changes of various algorithms. (a) ROC and (b) fitness value.

Taking the player’s attribute “Age” as an example, the “Age” of 5 team members is visualized. Taking the four time points of 2005, 2010, 2015, and 2020 as examples, the “Age” distribution of each team member is displayed in Figure 8. In Figure 8, the vertical axis signifies the player’s attributes, where the player’s “Age” is represented, and the horizontal axis represents the team number. From Figure 8(a), the age distribution of players in Team 2 and Team 4 was relatively more dispersed. From Figure 8(b), each team member had a significant age difference. In 2015, one player in Team 3 was relatively older, while the age distribution of other players was more concentrated. In 2020, the age distribution in Team 5 was relatively more dispersed. By visualizing the age data of team members, it is possible to clearly analyze the changes in the age attribute of the entire team members. If the “Age” attribute is replaced with other attributes, it can also show a clear change trend.

Figure 8.

To further validate the predictive performance of this method on player data, six players are randomly selected from 2005 to 2020 for analysis of their ability changes. After analyzing and comparing the comprehensive data of players, the overall ranking of players can be obtained, which in turn reflects their overall level. The resulting player ability changes are shown in Figure 9. From the visualization results displayed in Figure 9, the closer the curve, the higher the ranking. Years without data indicate that the player has not joined or left the team. The ranking differences of the same player in different years were significant. Taking Player 1 as an example, the rankings from 2005 to 2014 were 3, 3, 1, 11, 10, 7, 9, 12, 13, and 14, respectively. By visualizing the overall abilities of team players, it is possible to clearly and intuitively analyze the changes in their overall ranking over a period of time.

Figure 9.

Changes in player abilities.

To further analyze the performance of the designed method, it is applied to different athlete datasets to verify its performance, including the conventional statistical dataset in NBA games used in the previous section, the International Basketball Federation World Championships (FIBA) data from 2000 to 2020, and the Chinese University Basketball Association (CUBA) male athlete data. The accuracy of the research method in various datasets is displayed in Table 3. In Table 3, the accuracy of the proposed method (Method 5) in the three datasets was 95.36%, 94.07%, and 91.49% on NBA, FIBA, and CUBA datasets, respectively. Because there are differences in data characteristics among different datasets, there are differences in accuracy among different datasets. However, within the same dataset, the performance of the research method is relatively better.

Table 3.

Accuracy comparison of various methods in different datasets.

Methods	NBA	FIBA	CUBA
Method 1	80.45%	81.35%	79.04%
Method 2	93.65%	84.67%	86.62%
Method 3	82.34%	91.53%	81.87%
Method 4	91.90%	92.81%	88.25%
Method 5	95.36%	94.07%	91.49%

To further validate the performance of the research method, the updated dataset NBA Player Stats (2023-–2024) and player health and fitness dataset are selected for the study. The former includes regular statistical data of players. The latter includes physical fitness test data of players, namely, GPS physical fitness dataset, which evaluates the physical fitness needs of football players during training and competition through GPS devices, and records exercise loads (such as running distance, speed, acceleration and deceleration, and energy consumption). Comparatively speaking, method 5 still maintained optimal performance in the updated dataset. On the NBA Player Stats, the CPU usage time and Memory footprint of method 5 were 87 seconds and 2.1 GB. On the GPS Fitness Dataset, the CPU usage time and Memory footprint of method 5 were 56 seconds and 2.3 GB. Overall, the research method has the lowest computational cost and the highest computational efficiency (Table 4).

Table 4.

Computational cost comparison of various methods in different datasets.

Methods		NBA player stats	GPS fitness dataset
Method 1	CPU usage time (second)	180	120
Method 1	Memory footprint (GB)	3.7	2.9
Method 2	CPU usage time (second)	138	114
Method 2	Memory footprint (GB)	5.2	2.5
Method 3	CPU usage time (second)	172	105
Method 3	Memory footprint (GB)	4.1	3.1
Method 4	CPU usage time (second)	94	73
Method 4	Memory footprint (GB)	2.9	2.4
Method 5	CPU usage time (second)	87	56
Method 5	Memory footprint (GB)	2.6	2.3

Discussion and conclusion

Developing the competitive ability of basketball players is the fundamental goal of improving their performance. Therefore, analyzing the relevant data of team members is of great significance for analyzing the individual qualities and abilities. To better analyze ordinary players and competitive players, this study combined SMOTE and under-sampling techniques to handle the imbalance in the player dataset. Next, an ensemble learning method based on Voting was constructed to evaluate the performance of team players. In the analysis of imbalanced datasets, the G-mean and AUC values of the research method on the Credit Card dataset were 0.964 and 0.935, respectively. On the Credit Fraud dataset, the G-mean and AUC values were 0.931 and 0.956, respectively, which were higher than the comparison methods. In the player dataset analysis, the Precision, Recall, and F1 of the research method were 0.9486, 0.9588, and 0.9752, which were superior to comparison methods. The proposed method has a ROC value of 0.94 and a fitting value of 0.17, indicating better fitting performance. The accuracy of the proposed method in three datasets was 95.36%, 94.07%, and 91.49% on NBA, FIBA, and CUBA datasets, respectively, indicating higher accuracy in different datasets. The visualization results, taking the age of team members and their overall abilities as examples, clearly presented the specific personal basic data and comprehensive abilities of each team and player. This indicates that the player data visualization model based on Voting designed in the study can effectively analyze player related data and the overall quality of the team.

However, there are still shortcomings in the research. In player ability prediction, coaches have a guiding role in the development of the team. The study does not consider the influence of coaches on the team. Furthermore, this research method has only been applied to basketball players and its effectiveness in other scenarios has not been validated. In the future, this research method will further consider the impact of coaches on the development of players and teams when visualizing player abilities. At the same time, future attempts will be made to apply this research method to the visualization analysis of athletes in other sports such as football and volleyball.

Footnotes

ORCID iD

Zhixin Cong

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

Data will be made available on reasonable request.*

References

Wang

. Study on a method for capturing basketball player's layup motion based on grey level co-occurrence matrix. Int J Inf Commun Technol 2024; 24(1): 60–71.

Lian

Wang

, et al. ANN-enhanced IoT wristband for recognition of player identity and shot types based on basketball shooting motion analysis. IEEE Sens J 2022; 22(2): 1404–1413.

Rogers

Crozier

Schranz

, et al. Player profiling and monitoring in basketball: a delphi study of the most important non-game performance indicators from the perspective of elite athlete coaches. Sports Med 2022; 52(5): 1175–1187.

Wang

. Research on basketball dunk motion recognition method based on characteristic point trajectory. Int J Comput Intell Stud 2023; 12(1/2): 2–14.

Chuanxia

Yinhui

. Imbalanced fault diagnosis based on semi-supervised ensemble learning. J Intell Manuf 2023; 34(7): 3143–3158.

Wang

Tang

Shao

MAS

, et al. HoopTransformer: advancing NBA offensive play recognition with self-supervised learning from player trajectories. Sports Med 2024; 54(10): 2663–2673.

Haider

Waqas

Hanif

, et al. Network load prediction and anomaly detection using ensemble learning in 5G cellular networks. Comput Commun 2023; 197(1): 141–150.

Oshevwe Okoro

Ayo-Farai

Maduka

, et al. Emerging technologies in public health campaigns: artificial intelligence and big data. Acta Inform Malays 2024; 8(1): 05–10. DOI: 10.26480/aim.01.2024.05.10.

Zhang

Wang

, et al. Lithofacies prediction driven by logging-based Bayesian-optimized ensemble learning: a case study of lacustrine carbonate reservoirs. Geophys Prospect 2023; 71(9): 1835–1872.

10.

Saheed

Balogun

Odunayo

, et al. Microarray gene expression data classification via wilcoxon sign rank sum and novel grey wolf optimized ensemble learning models. IEEE/ACM Trans Comput Biol Bioinform 2023; 20(6): 3575–3587.

11.

Ismail

Alsalamah

Mohamed

. GA-Stacking: a new stacking-based ensemble learning method to forecast the COVID-19 outbreak. Comput Mater Continua (CMC) 2023; 74(2): 3945–3976.

12.

Kumar

Shweta

. Mitigating cyber threats through integration of feature selection and stacking ensemble learning: the LGBM and random forest intrusion detection perspective. Clust Comput 2023; 26(4): 2339–2350.

13.

Han

Kim

. Deep ensemble learning of tactics to control the main force in a real-time strategy game. Multimed Tools Appl 2024; 83(4): 12059–12087.

14.

Luo

Ul Islam

. Tracking and detection of basketball movements using multi-feature data fusion and hybrid YOLO-T2LSTM network. Soft Comput 2024; 28(2): 1277–1294.

15.

Anzer

Bauer

. Determining the difficulty of a pass in football (soccer) using spatio-temporal data. Data Min Knowl Discov 2022; 36(1): 295–317.

16.

Santos-Fernandez

Denti

Mengersen

, et al. The role of intrinsic dimension in high-resolution player tracking data-insights in basketball. Ann Appl Stat 2022; 16(1): 326–348.

17.

Qin

. Goal recognition method using intelligent analysis of basketball images under 6G Internet of Things technology. Int J Grid Util Comput 2022; 13(2/3): 138–145.

18.

Hemalatha

Amalanathan

FG-SMOTE . Fuzzy-based Gaussian synthetic minority oversampling with deep belief networks classifier for skewed class distribution. Inter J Intellig Comp Cybe 2021; 14(2): 269–286.

19.

Pozo

Gonzalez

ABR

Wilby

, et al. Prediction of on-street parking level of service based on random undersampling decision trees. IEEE trans Intell Transp Syst 2021; 23(7): 18327–18336.

20.

Searle

Kaszta

Bauer

, et al. Random forest modelling of multi-scale, multi-species habitat associations within KAZA transfrontier conservation area using spoor data. J Appl Ecol 2022; 59(9): 2346–2359.

21.

Rajput

Roy

Soni

, et al. ELM-based imbalanced data classification-A review. Inform: An Inter J Comp Inform 2024; 48(2): 185–204.

22.

Kurien

Chikkamannur

. A stacking ensemble for credit card fraud detection using SMOTE technique. Int J Eng Syst Model Simulat 2024; 15(6): 284–290.

23.

Lefebvre

Johnson

Marshall

, et al. 3D model visual verification and mesh-based data analysis in fulcrum. Trans Am Nucl Soc 2021; 124(6): 643–646.

24.

Mehta

Kaur

. Optimizing software fault prediction using voting ensembles in class imbalance scenarios. Int J Perform Eng 2024; 20(11): 676–687.

25.

Bhati

Shankar

Saxena

, et al. An ensemble-based approach for image classification using voting classifier. Int J Model Identif Control 2022; 41(1/2): 87–97.

26.

Shao

Zhang

. Design and implementation of sunshine sports system for colleges and universities based on c4.5 algorithm. Adv Ind Eng Manag 2024; 13(4): 150–159.

27.

Ayu

Sari

Farida

, et al. Classification of employee competency assessment using naive Bayes and K-nearest neighbor (KNN) algorithms. J Adv Inf Technol 2024; 15(7): 879–885.

28.

Gheisari

Hamidpour

Liu

, et al. Data mining techniques for web mining: a survey. Artif Intell Appl 2023; 1(1): 3–10.

29.

Bagci Das

. Real-time adaptable fault analysis of rotating machines based on Marine predator algorithm optimised LightGBM approach. Nondestr Test Eval 2025; 40(3): 831–866.

30.

Song

Zhu

, et al. Credit risk prediction based on improved ADASYN sampling and optimized LightGBM. J Social Comput 2024; 5(3): 232–241.