Sports competition data analysis and strategy optimization using K-means clustering algorithm

Abstract

Sports competition data analysis and strategy optimization are important ways to enhance athlete competitiveness and team collaboration. The current competition analysis and strategy formulation have strong subjectivity, making it difficult to deeply understand the performance characteristics and patterns of athletes and teams. Traditional analysis methods cannot accurately identify the performance differences of different athletes, and there are limitations in their feature recognition and classification. In order to enhance the scientificity of strategy formulation and improve the performance of athletes in competitions, this article combined the K-means clustering algorithm and focused on basketball sports to conduct an in-depth analysis of sports competition data analysis and strategy optimization. Firstly, the competition data was collected and preprocessed. Then, feature selection was carried out from three dimensions: competition results, player performance, and team characteristics. Finally, the K-means clustering algorithm was used to perform hierarchical clustering on the original data through a hierarchical method. To verify its effectiveness, this article conducted practical analysis on the data of nearly 5 basketball competitions in 10 university basketball leagues in a certain province and optimized strategies based on cluster analysis. The results showed that in terms of player performance, compared to before optimization, the average number of rebounds, assists, and steals of team players optimized based on algorithm strategy increased by about 38.9%, 25.0%, and 63.2%, respectively. The conclusion indicates that the application of K-means clustering algorithm in sports competition data analysis and strategy optimization can help improve the competitive level of athletes and enhance their performance.

Keywords

sports competition data analysis strategy optimization K-means clustering algorithm basketball performance

Introduction

The progress of social economy has driven the development and prosperity of the sports industry.^1,2 With the continuous expansion of the scale and quantity of competitions, sports data analysis has become an important method to improve the competitive ability of athletes. Sports competition data contains rich information, and traditional statistical analysis methods have limitations when dealing with large-scale and heterogeneous data. How to effectively mine and analyze these data in order to design more reasonable competition strategies is an urgent problem that needs to be solved in the current competitive sports industry. With the advent of the big data era, K-means clustering algorithms are increasingly being used in practical scenarios. It is easy to understand and implement, with high computational efficiency, and has broad application prospects in professional fields such as marketing, signal processing, image segmentation, and financial risk analysis. As a common clustering analysis method, it can effectively classify multi-dimensional data, thereby better discovering the internal connections and patterns between data.³ Applying it to competition data analysis can better understand the important influencing factors and patterns during the competition process, help coaches and athletes formulate strategies more accurately, and improve the overall strength and effectiveness of the team. This has important practical value and significance for promoting the development of the competitive sports industry and improving the level of competition.

In order to improve the scientificity of training strategy formulation and enhance the performance level of athletes, this article combines the K-means algorithm to study the analysis of sports competition data and strategy optimization. Taking basketball as an example, clustering analysis is conducted on competition data, and experimental analysis is conducted on this basis. To verify its effectiveness, this article takes participating teams from 10 university basketball leagues in a certain province as samples and takes data from nearly 5 competitions of each team as objects. Practical analysis is conducted from three levels: clustering results, player performance, and team performance. From the clustering results, it can be seen that the overall allocation of data at the three levels of competition results, player performance, and team characteristics is good, and the silhouette coefficient results of each clustering level are generally close to 1; from the perspective of player performance, compared to before optimization, the average number of rebounds, assists, and steals of team players optimized based on algorithm strategy increases by about 38.9%, 25.0%, and 63.2%, respectively; from the perspective of team performance, compared to before optimization, the average score of the team after optimization based on algorithm clustering analysis increases by 12.7%, and the difference in average scores between home and away competitions decreases by 50.0%. The innovation of this article lies in identifying different types of athletes or performance patterns through clustering analysis of athlete performance data, and then developing personalized training plans for each type. In practical applications, applying K-means clustering algorithm for sports competition data analysis and strategy optimization can help improve the overall performance of athletes and teams, and provide good support for enhancing their competitive level.

Related works

Sports data analysis is not only a tool but also a necessary means of optimizing athlete training.^4,5 With the rapid development and application of digital sports, sports competition data analysis and strategy optimization have also achieved certain results. To balance team building decisions, Muniz Megan provided a sports analysis model based on mixed integer nonlinear programming, which rephrased the objective function and overcame the computational challenges of nonlinearity. By using team competition data from 2019 to 2020 as a case study, it was demonstrated that the proposed model maximized the total value of the team and balanced the synergy between players.⁶ Cao Yuan established a multi-objective optimization method and a simulation model for sports activity exercise data in a multi-decision network design model based on existing decision network planning methods and situational analysis methods, and analyzed sports competition data and strategy optimization. Finally, the experimental results showed that the proposed model can simultaneously select multiple decision spaces.⁷ In order to gain a competitive advantage and manage injury risk, Torres-Ronda Lorena discussed the application of a quantitative training and competition feature tracking system in sports competition data analysis and strategy optimization, and demonstrated through examples that the system can monitor and evaluate the training and competitions of each sport, and support the prescription of training loads and objective decision-making of operations.⁸ Although existing analytical methods have certain guiding capabilities for improving athlete performance and competition strategies, they still struggle to effectively handle a large amount of multi-dimensional competition data and cannot uncover the potential patterns and correlations behind the data. They often focus on a few indicators or data dimensions, while ignoring the more complex and diverse factors in sports competitions, resulting in one-sided and incomplete analysis results.

The development of big data technology and algorithms has provided more possibilities for intelligent analysis of the potential patterns and correlations of competition data.⁹ Fujii Keisuke proposed a data-driven analysis method to quantitatively understand behaviors in invasive team sports such as basketball and football. By extracting easily interpretable features or rules from competition data, behaviors were generated and controlled in a visually understandable manner, providing support for sports strategy optimization.¹⁰ Wang Zejun proposed a football tactical analysis method based on position data in the context of big data, which automatically identified the characteristics of team tactics through machine learning algorithms. By using data visualization and reporting methods, objective information was provided for optimizing athlete performance results, and a new approach was provided for football tactical research.¹¹ Liu Aijun proposed an effective video-based visualization framework based on artificial intelligence and big data analysis, which extracted temporal and spatial features of sports videos to classify them. The experimental results showed that compared with other existing models, the proposed model achieved 98.7% precision in analyzing competition data.¹² Based on 100 sets of historical scoring data from 14 teams in the basketball league, Zhang Hanzhe qualitatively analyzed the level of each team using Analytic Hierarchy Process (AHP) and then compared these data through cluster analysis. This provided a more reasonable and scientific method for improving the probability of each team winning the championship, effectively promoting the development of basketball.¹³ Clustering algorithms can fully consider the ability to simultaneously consider multiple data dimensions and analyze competition situations more intelligently and comprehensively. However, most studies still have certain limitations in personalized tactical optimization and training program development.

Sports competition data analysis and strategy optimization

Sports competition data analysis is the analysis of various data information displayed by athletes based on their performance on the field, such as the performance of technical movements. While conducting longitudinal analysis of individual athlete competition data, it is also necessary to conduct horizontal analysis with other athletes. Basketball is a type of team sports event with rich statistical data, and the individual performance of each player has a significant impact on the results of the event.^14,15 This article takes basketball sports as an example and, through clustering analysis of competition data, explores the performance characteristics of athletes and teams to achieve strategy optimization.

Data collection and processing

Compared to real-time data, historical competition data can better reflect the sports development trends and changes of athletes over a long period of time. Therefore, in data collection, this article only analyzes historical competition data.

This article uses Python to obtain competition data from relevant websites. In the data collection process, the site is first opened from the searcher in yield, and the website is processed through the engine. Then, the crawler is used to obtain the URL (Uniform Resource Locator). When the engine obtains the first URL, it does not process it and sends it to the scheduler, as shown in Figure 1.

Figure 1.

Crawling process of competition data.

In Figure 1, when the scheduler processes it, it generates a request instruction and sends it to the engine for post-processing. When a request instruction is received, the scheduler feeds back each URL in the list to the engine, which sends feedback on the instruction to the downloader through a downloaded middleware. After the webpage download program is completed, the information is transmitted back to the engine through the download middleware. At this point, the engine receives a response to download the program and sends it to the crawler through the crawler middleware. The crawler processes these responses and returns the crawled items, and then sends the request instructions to the engine. The scheduler provides pending requests to the engine, ensuring that the same URL is not repeatedly crawled, while the engine is responsible for processing these requests and generating results. This process is repeated until there are no URL request instructions in the schedule, and the connection between the engine and the domain is then disconnected.

In response to the differences in data structure among different sites and the coexistence of peak concurrency in databases, stability must be maintained to ensure data integrity during corresponding crawler processing. This article uses Requests to send HTTP (Hypertext Transfer Protocol) requests to users and then uses BeautifulSoup to analyze the HTML (Hyper Text Markup Language) structure, extracting relevant data such as competition dates, opposing sides, scores, and player performance, and storing them in the Pandas data framework. In order to obtain more complete data, a third-party open source API (Application Programming Interface) is used to obtain JSON (JavaScript Object Notation) format data, which includes detailed statistical data of each competition, player personal data, team data, etc., and then analyzed and stored through Pandas. API has ideal response speed and stability, and in practical operation, HTTP protocol is used to call API. In JSON data conversion, the pandas. read_comson() function is used to directly read JSON strings or files, and Pandas’ data processing capabilities are used to clean and convert the data.

After collecting competition data, data cleaning and preprocessing are carried out to obtain a more concise and accurate dataset. For data with a small proportion of missing data, the mean of the column is used to fill in; if the missing ratio exceeds 40%, it is directly removed. Incomplete data filled in with blank values in the record; excess data is directly deleted; for inconsistent data types, standardized feature processing methods are used to express discrete data using numerical values ${1, 2, 3 \dots}$ , standardize the values of each attribute, and map these data uniformly to the [0,1] interval. This step is represented by the following formula:

\bar{x} = \frac{x - x_{\min}}{x_{\max} - x_{\min}}

(1)

Among them, the definition of formula variables is shown in Table 1.

Table 1.

Definition of variables in formula (1).

Sequence	Variables	Meaning
1	$x$	Raw competition data
2	$\bar{x}$	Normalized competition data
3	$x_{\min}$	Minimum value of data attributes
4	$x_{\max}$	Maximum value of data attributes

Feature selection

In order to more accurately depict the competition situation, a deep analysis of the performance of athletes and the characteristics of the team is conducted, laying a good foundation for subsequent data analysis and decision optimization. Based on the characteristics of basketball competitions, feature engineering is constructed in the collected competition data. From the three aspects of competition results, player performance, and team characteristics, feature data is extracted and then used as input for the algorithm, as shown in Figure 2.

Figure 2.

Feature selection.

Competition results

In the competition results, the main objective variables are the victory or defeat and score difference. The victory or defeat of a competition directly reflects the team’s performance in the competition, and the difference in scores affects the team’s ranking. When the scores of two teams are the same, the goal difference is calculated.^16,17 The ranking of a team is not only determined by one factor. Therefore, this article conducts an in-depth analysis of the competition data based on the two levels of victory or defeat and score difference. The characteristic values are shown in Table 2.

Table 2.

Feature selection of competition results.

Feature name	Sequence	Classification	Definition
Competition results	1	Win or lose the competition	Win
	1	Win or lose the competition	Lose
	2	Score difference	The average difference between each competition and the previous competition

Player performance

In basketball, player performance is very important, and their scores, rebounds, assists, and other data can intuitively reflect the player’s role and ability on the field.¹⁸ Player performance needs to be explored and analyzed from multiple levels and perspectives. Its detailed features are presented in Table 3.

Table 3.

Feature selection of player performance.

Feature name	Sequence	Classification	Definition
Player performance	1	Scores	The score of per player in the competition
	2	Rebounds	The number of rebounds per player in the competition
	3	Assists	The number of assists per player in the competition
	4	Steals	The number of tackles per player in the competition
	5	Errors	The number of mistakes made per player in the competition
	6	Fouls	The number of fouls per player in the competition

Team characteristics

In addition to the individual performance of the players, the overall characteristics of the team are also the main factor determining the outcome. The historical records and average scores of each team can reflect their overall strength and characteristics. In terms of feature selection, this article mainly analyzes the competition data from two levels: historical win rate and average score. Its detailed introduction is shown in Table 4.

Table 4.

Feature selection of team characteristics.

Feature name	Sequence	Classification	Definition
Team characteristics	1	Historical win rate	(Team victories/total competitions) * 100%
	2	Average score	Total score of the last 5 competitions/5
	3	Score difference between home and away competitions	Home team’s last 5 competitions scored—visiting team’s last 5 competitions scored

Cluster analysis

On the basis of feature selection, the cluster analysis is used to deeply mine and analyze the data. As an unsupervised learning method, clustering analysis measures the similarity between data and gathers data that meet similar conditions, while those that do not match are classified into different groups.^19,20 In competition data analysis, clustering algorithms are used to divide competition data into several clusters, effectively identifying the characteristics of participating players and teams. By exploring the patterns and rules contained in competition data, the understanding of competition results, player performance, and team characteristics can be deepened.

The K-means algorithm is a simple and effective algorithm.^21,22 It is suitable for partitioning competition data with large-scale characteristics. The basic idea is to divide the dataset into K clusters, so that the data points within each cluster are similar to each other, while the data points between different clusters have significant differences. K-means iteratively finds the structure of the dataset, with the core being to define the center points of clusters and partition the data based on these centroids. This article combines the K-means algorithm to analyze feature data, as shown in Figure 3.

Figure 3.

K-means algorithm clustering.

From Figure 3, it can be seen that in clustering analysis, k targets are randomly selected from the competition feature data to form the initial cluster. Then, based on the principle of high similarity between the same cluster objects and low similarity between different cluster objects, the remaining targets are assigned to the corresponding clusters, and a new cluster is formed based on this. The average of the data in the new cluster is taken and used as the center of the new cluster until the cluster center no longer changes, and finally the clustering result is output.

The similarity between data points within the same cluster is measured using Euclidean distance. Assuming that the competition dataset $D$ contains $n$ data objects, similarity is defined and represented as $s i m$ : $D \times D > R$ . Given a positive integer $k$ , the dataset $D$ is divided into $k$ disconnected parts $D_{1}, D_{2}, \dots, D_{k}$ , so that formula (2) is minimized²³:

DISP = \sum_{k = 1}^{K} disp (D_{i}, p_{i})

(2)

Among them, $p_{i}$ represents point $D_{i}$ . $disp (D_{i}, p_{i})$ is defined as follows:

disp (D_{i}, p_{i}) = \sum s i m (C, p_{i})

(3)

On this basis, using the established clustering objective function and iteratively correcting it, each iteration is carried out towards the direction of minimizing the objective function, thereby obtaining the optimal clustering result. The competition data sample set $X$ that needs to be clustered is set as data with $p$ $R$ -dimensional spaces and represented as $X = {x_{i} | x_{i} \in R^{p}, i = 1, 2, \dots, N}$ . The resulted $C$ cluster centers are $l_{1}, l_{2}, \dots, l_{c}$ . Assuming $M_{j} (j = 1, 2, \dots, C)$ represents the $C$ categories of clustering, then:

l_{j} = \frac{1}{n} \sum_{x \in x_{j}} x

(4)

The objective function $F (x)$ is defined as follows:

F (x) = \sum_{i = 1}^{C} \sum_{j = 1}^{N_{i}} d_{i j} (x_{j}, l_{i})

(5)

Among them, $N_{i}$ is the number of samples included in the $M_{i}$ clustering category. $d_{i j} (x_{j}, l_{i})$ is the Euclidean distance^24,25:

d_{i j} (x_{j}, l_{i}) = s q r ((x_{j} - l_{i}) * (l_{j} - l_{i}))

(6)

d_{i j} (x_{j}, l_{i})

is the distance between sampling points, and the sum of squared distances from sampling points to cluster centers is the objective function

F (x)

, that is,

F (x)

minimizes the mean square error of clustering.

Basketball competition data has a large scale and many features, and using only the K-means method limits its discriminative ability. Moreover, excessive data dimensionality can also lead to poor clustering performance. To improve the time complexity of the algorithm and obtain better initial centers, a hierarchical clustering method is applied to the original data, as shown in Figure 4.

Figure 4.

Hierarchical clustering analysis.

In Figure 4, for the problem of determining the value of k, the approximate number of classes is determined by the silhouette coefficient first. It is assumed that the competition data can be clustered to form a result, that is, these data have been divided into several clusters. For any data object $i$ , $u_{i}$ represents the average degree of dissimilarity of object $i$ relative to other objects in the same classification, which represents the quality of assigning object $i$ to the current cluster. If the similarity between object $i$ and other data objects in the cluster is higher, it indicates that assigning data object $i$ to the current cluster is appropriate. $u_{i}$ is expressed as follows:

u_{i} = \frac{1}{C_{i} - 1} \sum_{j \in C_{i}, j \neq i} d_{i j} (x_{j}, l_{i})

(7)

It is assumed that $v_{i}$ represents the minimum average distance between object $i$ and other clusters, which is the distance between the cluster where object $i$ is located and the nearest cluster. If object $i$ is not properly assigned to the current cluster, it should be assigned to the cluster set closest to the current cluster.²⁶ $v_{i}$ is expressed as follows:

v_{i} = \min_{C_{i} : 1 \leq j \leq k, j \neq i} \frac{1}{| C_{i} |} \sum_{j \in C_{i}, j \neq i} d_{i j} (x_{j}, l_{i})

(8)

Then, the profile coefficient $δ_{i}$ is defined as follows²⁷:

δ_{i} = \frac{v_{i} - u_{i}}{\max {u_{i}, v_{i}}}

(9)

The silhouette coefficient helps evaluate whether the clustering is reasonable, that is, whether the samples within each cluster are compact and whether good separation is maintained between different clusters. A high silhouette coefficient (close to 1) indicates that the sample matches well in the cluster and there is a clear distinction between clusters, and a low silhouette coefficient (close to −1) indicates that the sample may have been incorrectly assigned to adjacent clusters or that the sample within the cluster is not tightly packed. Among them, $δ_{i} \in [- 1, 1]$ . In the case where K has the maximum value, this value is taken as the initial value; the hierarchical clustering method is adopted to merge two adjacent classes into a new class; then, the fused class is subjected to a new cluster center calculation to obtain the average value of the two cluster centers in the previous level; the two adjacent clusters are repeatedly merged until they reach (K-R) (0<=R<=K-2) clusters. The intra cluster similarity of each partitioned cluster is calculated; the cluster set with the maximum class radius in the cluster is selected and decomposed, and the sample point $x_{i 1}$ farthest from the cluster center is found. On this basis, the sample point $x_{i 2}$ farthest from $x_{i 1}$ is selected, and then these two points are merged with other cluster centers to form a new cluster center, which is then subjected to K-means clustering.^28,29 During this process, if the centroid changes, the set of classes with the lowest similarity in the cluster is further selected for decomposition. If the centroid does not change, the clustering result is output. Using hierarchical clustering method, after determining the hierarchical structure, local adjustments are made to it, greatly saving computational time. In addition, during the process of local adjustment of the original center, intra cluster similarity is applied as an evaluation criterion. The cluster with the lowest similarity within the cluster is decomposed into two new clusters, and the cluster set that is mistakenly classified as one is locally adjusted to select the original center more reasonably and also facilitate calculation. By maximizing intra-cluster similarity, the impact of noisy data points on clustering results can be reduced, ensuring that data points in each cluster are highly correlated and avoiding unstable clustering results caused by random initialization or uneven data distribution, thereby improving the robustness of clustering.

Based on these characteristics, targeted training and competition strategies are developed for each cluster. By analyzing post-competition data, training plans and tactical arrangements are continuously optimized, ensuring the effectiveness and adaptability of strategies, and achieving better competition performance and higher win rates.

Basketball competition data analysis and strategy optimization practice

In order to analyze the data analysis and strategy optimization effects of sports competitions using K-means clustering algorithm, this article takes participating teams from 10 university basketball leagues in a certain province as samples and takes the data of nearly 5 competitions of each team (5 players per team) as the object for data analysis and strategy optimization practice. The practical effects are verified from three levels: clustering results, player performance, and team performance.

Experimental data

Using the official website of university basketball leagues publicly available in the province as the data source, the competition data from the experiment is obtained through web scraping. In data collection, by using the Scrap crawler, starting from the homepage of the website, and extracting competition links, each competition page is accessed one by one, and then the date, participating teams, and competition result information of each competition are extracted. The data mainly includes basic information of athletes, game data such as scores, rebounds, assists, game results, and game dates. Considering the timeliness of competition data, in the processing of differences between long-term and recent data, the weighting is gradually reduced in chronological order, with the latest game having the highest weight and gradually decreasing to the earliest game.

The collected data is cleaned up, and missing values and outliers are processed to ensure data quality. In order to eliminate dimensional effects and ensure comparability of data, different data multiples are standardized. Then, the competition results, player performance, and team characteristic values are extracted from the original competition data. 10-fold cross-validation can provide a more stable model performance estimate, with 90% of the data used to train the model in each cross validation cycle, which can reduce the risk of model overfitting. The data is divided into 10 groups according to sample distribution, with 9 groups being the training group and 1 group being the testing group. Using K-means clustering algorithm, cluster analysis is conducted on the selected teams to study the competition characteristics and performance of each team within the cluster.

Experimental results

Cluster results

Based on the competition data obtained by all teams, the characteristics of the participating player clusters, competition results, player performance, and team characteristics are calculated through hierarchical clustering. The results are shown in Tables 5–7.

Table 5.

Characteristic results of competition results.

Cluster hierarchy	Cluster sequence	Number of teams	Win (times)	Lose (times)	The average difference between each competition and the previous competition (points)
Competition results	1	2	[2,3)	[1,4)	[3,7)
	2	3	[0,2)	[3,5)	[6,12)
	3	3	[3,5)	[0,2)	[5,10)
	4	2	[1,3)	[2,4)	[1,3)

Table 6.

Results of player performance characteristics.

Cluster hierarchy	Cluster sequence	Number of teams	Scores (points)	Rebounds (times)	Assists (times)	Steals (times)	Errors (times)	Fouls (times)
Player performance	1	3	[20,35)	[2,6)	[1,3)	[0,3)	[0,3)	[1,2)
	2	1	[15,20)	[1,4)	[6,10)	[4,7)	[1,5)	[2,6)
	3	2	[10,25)	[5,8)	[3,6)	[2,3)	[2,3)	[2,3)
	4	2	[15,30)	[3,5)	[2,5)	[0,5)	[1,4)	[2,5)
	5	2	[20,30)	[1,4)	[2,7)	[1,3)	[0,2)	[0,1)

Table 7.

Feature results of team characteristics.

Cluster hierarchy	Cluster sequence	Number of teams	Historical win rate (%)	Average score (points)	Score difference between home and away competitions (points)
Team characteristics	1	2	[40,70)	[80,90)	[10,20)
	2	4	[30,60)	[80,95)	[20,25)
	3	2	[70,85)	[95,115)	[10,25)
	4	2	[50,75)	[90,110)	[25,35)

From Table 5, it can be seen that the team in Cluster 3 has the highest winning result range, indicating that the team in this cluster has a better level of state and performance in the competition; the team in Cluster 4 has the lowest average difference interval between each competition and the previous competition, indicating that the team in this cluster has relatively ideal stability in their performance in the competition.

From Table 6, it can be seen that the scoring range of each player in the team in Cluster 1 is the highest, reaching [20,35), indicating that the attacking ability of the team in Cluster 1 is relatively ideal; the scoring range of each player in Cluster 2 is the highest in the number of assists and steals in the competition, reaching [6,10) and [4,7), respectively, indicating that the team in Cluster 1 has excellent cooperation and personal defensive ability; the team in Cluster 3 has the highest number of rebounds in the competition, reaching [5,8), indicating that the team in Cluster 2 has a higher level of offensive and defensive rebounds; the team in Cluster 5 has the lowest number of errors and fouls scored in the competition, reaching [0,2) and [0,1), respectively, indicating that the team members in the cluster have ideal control abilities on the field.

From Table 7, it can be seen that the team in Cluster 3 has the highest historical win rate and average score interval results, reaching [70,85) and [95, 115), respectively, indicating that the team in this cluster achieves relatively ideal results in the last 5 competitions. The team in Cluster 1 has the smallest score difference interval results in home and away competitions, with specific interval results of [10, 20), indicating that the team in this cluster is less affected by the competition environment and the overall performance level of the team is relatively stable.

To verify the clustering results, the silhouette coefficients of the clustering analysis results are calculated. Among them, the coefficient results are within the range of [0,1). When the coefficient is close to −1, it indicates that the data object is significantly different from the cluster object, and the overall allocation of the dataset is poor; when the coefficient is close to 1, it indicates that the data object and the cluster object have less dissimilarity, and the overall allocation of the dataset is better. The final result is shown in Figure 5.

Figure 5.

Silhouette coefficient result. (a) Coefficient results of the competition result hierarchy. (b) Coefficient results of player performance levels. (c) Coefficient results of team characteristic levels.

From Figure 5, it can be seen that the overall silhouette coefficient results of each clustering level are close to 1, indicating a better overall allocation of data for the three levels and a better clustering effect. From the specific experimental results, in Figure 5(a), when the number of clusters is 4, the coefficient result reaches 0.833; in Figure 5(b), when the number of clusters is 5, the coefficient result reaches 0.862; in Figure 5(c), when the number of clusters is 4, the coefficient result reaches 0.869. The clustering algorithm effectively captures the intrinsic structure of the data. All clustering levels show high silhouette coefficients, close to 1. At each level, data points are tightly clustered in their respective clusters and maintain good separation from other clusters. This result indicates that the distribution of sample points in the competition data is reasonable, with good inter class separation. The clustering analysis results for competition results, player performance, and team characteristics have high credibility and interpretability.

Based on the clustering results, targeted strategies are developed according to the characteristics of each cluster. In training and competition, the team in the test set is used as the sample object, and the lineup configuration is adjusted based on the team’s player performance and team characteristics. For players with strong offensive skills, emphasis should be placed on attacking efficiency, and training should be provided for quick counterattacks and perimeter shooting. For players with strong defense, defensive intensity is emphasized, and rebounding competition and defensive positioning training are strengthened. The effects of strategy optimization before and after from two aspects are compared: player performance and team performance.

Player performance

In the competition, the individual abilities, technical level, and competitive state of players are crucial for the success of the team.³⁰ Optimization strategies are applied to the daily training of the team, with a training period of 8 weeks. Before and after the end of the training cycle, the team is tested 5 times, and their average performance before and after optimization is compared in terms of rebounds, assists, steals, errors, and fouls. The final result is shown in Figure 6.

Figure 6.

Comparison of player performance. (a) Performance of players before optimization. (b) Performance of players after optimization.

From Figure 6, it can be seen that under the strategy optimization based on cluster analysis, there is a significant improvement in the performance of players before and after training. In Figure 6(a), the average number of rebounds, assists, steals, turnovers, and fouls of the team members before optimization reaches approximately 3.6, 7.2, 3.8, 1.4, and 1.2, respectively; in Figure 6(b), the average number of rebounds, assists, steals, turnovers, and fouls of the team members after optimization is about 5, 9, 6.2, 0.6, and 0.4, respectively. From the specific comparison results, compared to before optimization, the average number of rebounds, assists, and steals of players in the team after optimization increases by about 38.9%, 25.0%, and 63.2%, respectively. This result not only reflects the improvement of the team in rebounding, assisting, and stealing events but also reflects the comprehensive improvement of the team’s overall strategy, individual player abilities, and teamwork. Through data analysis guided strategy optimization, the team can continuously improve its performance. Coaches can build tactical systems around versatile players, as they can contribute to both offense and defense. The rebounding king can serve as a tactical supplement, with the former focusing on scoring at critical moments and the latter playing a role in rebounding protection.

Team performance

As a sport that heavily relies on teamwork, team performance plays a decisive role in the outcome of basketball competitions. This article compares the team performance before and after strategy optimization from two aspects: scores and the difference in scores between home and away competitions. The final results are shown in Figure 7.

Figure 7.

Comparison of team performance. (a) Team performance before optimization. (b) Team performance after optimization.

From Figure 7, it can be seen that there are certain differences in team performance before and after optimization. In Figure 7(a), the average score of the team in each competition and the difference in average scores between home and away competitions before optimization are 88.4 points and 16.4 points, respectively; in Figure 7(b), the average score of the optimized team in each competition and the difference in average scores between home and away competitions are 99.6 points and 8.2 points, respectively. From the specific comparison results, compared to before optimization, the average score of the team after clustering analysis optimization based on K-means algorithm increases by 12.7%, and the average score difference between home and away competitions decreases by 50.0%. This result indicates that under strategic optimization, the overall performance of the team and its adaptability in different environments have been improved.

Discussion

In the experimental analysis, to verify the effectiveness of applying K-means clustering algorithm in sports competition data analysis and strategy optimization, this article conducts experimental analysis from three levels: clustering results, player performance, and team performance. From the clustering results, the overall silhouette coefficient results of the clustering analysis of the competition results, player performance, and team characteristics are close to 1. This indicates that the algorithm in this article has performed a good classification of the competition data, which can well reflect the internal structure and classification characteristics of the data. From the perspective of player performance, after optimizing the training strategy, the number of rebounds, assists, and steals of each player in the team has increased, while the number of errors and fouls has decreased to a certain extent; from the perspective of team performance, compared to before optimization, the average score of the optimized team in each competition and the difference in average scores between home and away competitions have achieved significant improvement. Supported by the K-means clustering algorithm, the overall performance of players and teams has been effectively improved through analysis of competition data and optimization of training strategies.

Conclusions

The expansion and development of the sports industry and event scale have put forward higher requirements for the level of mobilization and competition. The current traditional competition analysis and strategy formulation often rely on the subjective experience of coaches, making it difficult to comprehensively analyze the performance characteristics of players and the entire team. In order to improve the scientificity of training and competition strategy formulation, and improve the technical level of athletes, this article combined the K-means clustering algorithm to analyze the competition data and strategy optimization of basketball as the object. With the support of K-means clustering algorithm, a good classification of competition data was achieved, and on this basis, the performance of players and the overall performance of the team were effectively improved, enhancing their competitive level. K-means clustering can help coaches and analysts identify different types and styles of athletes and customize personalized training plans and competition strategies based on clustering results, not only improving the scientific and competitive level of competitive sports but also promoting the modernization and data-driven transformation of the sports industry. In actual sports training, based on clustering results, the strengths and most suitable roles of each player can be identified, and personalized training plans and on-field responsibilities can be developed. Periodic data analysis can be implemented, and player performance and strategy effectiveness can be monitored to adjust training plans and game strategies in a timely manner. The application of cluster analysis and K-means algorithm in game data analysis helps optimize resource allocation, provide forward-looking guidance for long-term planning and short-term adjustments of teams, and promote the intelligent, efficient, and fair development of sports in the future. Although this study can provide some guidance for the daily training and competition strategy formulation of athletes to a certain extent, there are also certain limitations. The environment and conditions of sports competitions often change, and strategy optimization needs to be able to flexibly adapt to these changes, rather than just relying on static analysis of historical data. This article did not delve into the impact of the timeliness and regionality of competition data on the analysis results, and the universality of the algorithm in practical scenarios still needs further optimization and verification. Future research should consider expanding data sources and sample ranges, and delving into the impact of different factors on the analysis results, in order to promote the intelligent and healthy development of the sports industry.

Statements and declarations

Footnotes

Conflicting interest

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

References

Orunbayev

. Globalization and sports industry. Am J Soc Sci Hum Res 2023; 3(11): 164–182.

Yang

. The construction of sports culture industry growth forecast model based on big data. Personal Ubiquitous Comput 2020; 24(1): 5–17. DOI: 10.1007/s00779-019-01242-z.

Yuan

Yang

. Research on K-value selection method of K-means clustering algorithm. Multi Sci J 2019; 2(2): 226–235. DOI: 10.3390/j2020016.

Xia

Lei

. Research on optimization of post-competition training methods for competitive sports athletes. J Sport Psychol 2023; 32(2): 95–105.

Baumer

Matthews

Nguyen

. Big ideas in sports analytics and statistical tools for their investigation. WIREs Computational Stats 2023; 15(6): 1612–1635. DOI: 10.1002/wics.1612.

Muniz

Flamand

. Sports analytics for balanced team-building decisions. J Oper Res Soc 2023; 74(8): 1892–1909. DOI: 10.1080/01605682.2022.2118634.

Cao

Huaqing

. High-dimensional multi-objective optimization strategy based on directional search in decision space and sports training data simulation. Alex Eng J 2022; 61(1): 159–173. DOI: 10.1016/j.aej.2021.04.077.

Torres-Ronda

Beanland

Whitehead

, et al. Tracking systems in team sports: a narrative review of applications of the data and sport specific analysis. Sports Med Open 2022; 8: 15–36. DOI: 10.1186/s40798-022-00408-z.

Rajsp

Fister

Jr . A systematic literature review of intelligent data analysis methods for smart sport training. Appl Sci 2020; 10(9): 3013–3043. DOI: 10.3390/app10093013.

10.

Fujii

. Data-driven analysis for understanding team sports behaviors. J Robot Mechatron 2021; 33(3): 505–514.

11.

Wang

Songhui

. Football tactical analysis methods and development trends based on location data in the context of big data. Journal of Shanghai Institute of Sports 2021; 45(9): 60–69. DOI: 10.16099/j.sus.2021.09.006.

12.

Liu

Prasad Mahapatra

Mayuri

AVR

. Hybrid design for sports data visualization using AI and big data analytics. Complex Intell Syst 2023; 9(3): 2969–2980. DOI: 10.1007/s40747-021-00557-w.

13.

Zhang

Wang

. Leveraging statistical theory in sports competitions: An analysis of probabilistic models and multiple regression within the framework of big data. Revista multidisciplinar de las Ciencias del Deporte 2024; 24: 19–41.

14.

Hita

IPAD

Krisna Dewi

Indrawan

IKAP

, et al. Socialization of basketball game rules. Jurnal Pengabdian Dan Pemberdayaan Masyarakat Indonesia 2023; 3(1): 20–27.

15.

Davide

Xavi

Andrea

, et al. Match activities in basketball games: comparison between different competitive levels. J Strength Condit Res 2020; 34(1): 172–182. DOI: 10.1519/JSC.0000000000003039.

16.

Zhang

Lorenzo

Zhou

, et al. Performance profiles and opposition interaction during game-play in elite basketball: evidences from National Basketball Association. Int J Perform Anal Sport 2019; 19(1): 28–48. DOI: 10.1080/24748668.2018.1555738.

17.

Yuliandra

Fahrizqi

. Development of endurance with the ball exercise model in basketball games. JP JOK 2020; 4(1): 61–72. DOI: 10.33503/jp.jok.v4i1.980.

18.

Ferioli

Rampinini

Martin

, et al. Influence of ball possession and playing position on the physical demands encountered during professional basketball games. Biol Sport 2020; 37(3): 269–276. DOI: 10.5114/biolsport.2020.95638.

19.

Ahmed

Seraj

Islam

SMS

. The k-means algorithm: A comprehensive survey and performance evaluation. Electronics 2020; 9(8): 1295–1306. DOI: 10.3390/electronics9081295.

20.

Hossain

MDZ

Nasim Akhtar

Badlishah Ahmad

, et al. A dynamic K-means clustering for data mining. Indones J Electr Eng Comput Sci 2019; 13(2): 521–526. DOI: 10.11591/ijeecs.v13.i2.pp521-526.

21.

Wang

Gittens

Mahoney

. Scalable kernel k-means clustering with Nystrom approximation: Relative-error bounds. J Mach Learn Res 2019; 20(12): 1–49.

22.

Nguyen

De Baets

. Kernel-based distance metric learning for supervised $ k $-means clustering. IEEE Transact Neural Networks Learn Syst 2019; 30(10): 3084–3095. DOI: 10.1109/TNNLS.2018.2890021.

23.

Liu

. Simplemkkm: Simple multiple kernel k-means. IEEE Trans Pattern Anal Mach Intell 2022; 45(4): 5174–5186. DOI: 10.1109/TPAMI.2022.3198638.

24.

Shrifan

NHMM

Akbar

Nor Ashidi Mat Isa . An adaptive outlier removal aided k-means clustering algorithm. J. King Saud Univ Comp & Info Sci 2022; 34(8): 6365–6376. DOI: 10.1016/j.jksuci.2021.07.003.

25.

Sivaguru

Punniyamoorthy

. Performance-enhanced rough k-means clustering algorithm. Soft Comput 2021; 25(2): 1595–1616. DOI: 10.1007/s00500-020-05247-2.

26.

Ghazal

. Performances of k-means clustering algorithm with different distance metrics. Intell Autom Soft Comput 2021; 30(2): 735–742. DOI: 10.32604/iasc.2021.019067.

27.

Cohen-Addad

Kanade

Mallmann-trenn

, et al. Hierarchical clustering: Objective functions and algorithms. J ACM 2019; 66(4): 1–42. DOI: 10.1145/3321386.

28.

Gao

Jacob

Daniela

. Selective inference for hierarchical clustering. J Am Stat Assoc 2024; 119(545): 332–342. DOI: 10.1080/01621459.2022.2116331.

29.

Yoo

Younju

Yeon

, et al. On hierarchical clustering in sufficient dimension reduction. Commun Stat Appl Methods 2020; 27(4): 431–443. DOI: 10.29220/CSAM.2020.27.4.431.

30.

Rangel

Gilbert

Santana

, et al. Integrated evaluation of team strategy, training practices and game performance of a basketball team. Int J Sports Sci Coach 2023; 18(1): 197–206. DOI: 10.1177/17479541221076.