Outlier data mining model for sports data analysis

Abstract

The results of data mining can be used to predict the physical health status of sports athletes and college sports students and provide physical fitness warnings, so that students can pay attention to physical health status and adjust their physical exercise status. Discrete Morse theory, as a powerful optimization theory, plays a big role in algorithm optimization. This paper combines data mining and discrete Morse theory to propose a grid clustering algorithm based on discrete Morse theory. Moreover, according to the theorem that the cell complex reaches the optimum when it has the smallest possible critical point, this study applies the concept of critical points in the discrete Morse theory to optimize the grid clustering process to obtain clustering results. In addition, this study uses the improved C4.5 algorithm to analyze the physical fitness assessment results and obtains a valuable analysis of the physical fitness assessment results.

Keywords

Discrete data data mining machine learning sports

1 Introduction

At present, sports, as an important item in the development of human civilization, has made great achievements. The higher, faster and stronger spirit of the Olympics is deeply rooted in the hearts of the people. The world events held by the individual federations, such as the World Cup of football and the World Championship of Basketball, have become common festivals for fans all over the world. As a developing country with one-fifth of the world’s population, China has become a veritable sports power through the unremitting efforts of several generations of sportsmen. It has become the norm for Chinese table tennis champions to take the Olympics, and the name of the diving dream team is well deserved, and the women ’s volleyball team won the World Cup and Olympic double champions, which has pushed Chinese sports to a whole new level. In addition, the upcoming 2022 Winter Olympics will once again focus the world’s attention on China. At present, China’s sports undertakings are at a critical stage of the transformation from a big sports country to a strong sports country [1]. However, as the professional basketball project most concerned by the broad masses of the people, the overall level has not achieved the ideal results achieved with the rapid development of Chinese sports, and gradually lost its dominance in Asian basketball [2], and the gap with the world’s advanced level is showing a trend.

With the progress and development of Internet technology and information technology, China’s research and application of data mining technology has achieved many results. However, in the process of professional athletes instructing athletes to train and compete, the application of data mining technology is still in the exploration stage. Xiangyang Xie believed that data mining technology is rarely used in sports, especially in the exploration stage in China. With the comprehensive information development of China’s sports undertakings, a large amount of data has been accumulated in full-name fitness, professional sports, sports industry, and sports scientific research and teaching. How to effectively use these data and discover information that is often overlooked but of great value has become an important mission for contemporary sports workers [3]. China’s sports field accumulates and continuously expands a large amount of data, so it is a very critical job for sports workers to analyze massive data and mine data value with the help of DM method. At present, the sports field has accumulated massive data, and the application of data mining technology in the sports field has very great prospects [4], which will play an increasingly important role. Using the research methods of sports statistics, Xinhui Zhao collated the research results and related literature of data mining in the field of sports science, and classified and summarized the documents according to different types of analysis management, training applications, sports scientific research, etc. In conclusion, it is found that the current data research data in the field of sports is relatively few for the establishment and application of databases in the field of sports, and it is more focused on simple theoretical analysis. Although the current research is not deep enough, it is gratifying that mining technology has been introduced into sports data analysis projects by some domestic experts and scholars. This phenomenon has important historical significance for us to use data mining technology to analyze sports data in the future, to promote the development of data mining technology in the field of sports, and to apply the analysis results to sports development decision-making and the orderly development of individual sports science.

2 Related work

This literature used data mining technology to construct mining models to analyze and analyze the sports activities and competition training of college students. Moreover, the literature has developed a scientific and feasible competition training system and used the system to generate teaching programs for daily teaching, which has received favorable comments from teachers and students [5]. With the help of K-Means fast clustering method in data mining, decision tree, CHAD decision tree, sequence association rules, Bayesian network and other means, the literature [6] analyzed the sports consumption structure and characteristics of urban residents, and built models based on relevant data. Moreover, this literature discriminates the characteristics of urban residents’ different sports consumption levels and discusses the relevance of urban residents’ sports consumption items, which provides a theoretical basis and reference value for other scholars to study the statistical regularity of urban residents’ sports consumption. The literature [7] analyzed the characteristics of tennis technical and tactical decision-making needs, used association to build a mining model, and took the data of the correlation between tennis technical and tactical points and hit points as a target to conduct a systematic study. Finally, it is concluded that there is a certain correlation between the points scored and the hitting point in the tennis game, which provides a powerful decision basis for the scheme of controlling different hitting points in the tennis game. The literature [8] built a blue flag technical and tactical analysis system centered on key actions, basketball technical characteristics, and tactical orientation with the help of data mining technology association analysis, Markov-based data mining analysis, and cluster analysis. The analysis result can guide basketball training and competition and has high practical value. The literature [9] analyzed the factors that affect the results of the game in table tennis and built an application model based on the Markov mining algorithm. Finally, it is concluded that in the table tennis competition, the key factors for winning the game are control to hold and control to attack, which is in line with the actual situation of the first world table tennis competition. The literature [10] used data mining association rule algorithm and combined table tennis skills and tactics prototype analysis system to write middleware of development library. This analysis tool has become an effective means to guide the analysis of table tennis techniques and tactics in the coal system, and it has enriched the selection of table tennis techniques and tactics. The literature [11] uses the data mining basketball game phenomenon to use association rules to display real-time live game data and analyze the scene to assist the coaches, which provides reference value and theoretical basis for the deployment strategy of coaches. After predicting the tactics the opponent will use next through data mining analysis methods, the team will be able to take appropriate countermeasures in advance and obtain a chance for the game. The literature [12] systematically analyzed the development status of the data mining analysis system of the dribble project, and constructed model design principles and studied the functional structure of evaluation. After weighing the analysis results of the volleyball actual data mining system, the scholar proposed that the competitive volleyball data mining system has the characteristics of compound functions, data sharing, simple modules, and convenient operation. Moreover, the system can reveal the characteristics of the second pass distribution in the volleyball game, the reasons for the card wheel and the relationship between the points difference and the winning and losing rate of the game, etc., and reflect the inherent regular information of the volleyball game. In addition, developed countries in Europe and America, as sports powers, have applied data mining technology to the operation of professional sports clubs. In particular, in the four major professional league clubs in the United States, data mining has become an indispensable technical and tactical analysis tool and plays an important role in training and competition. The literature [13] divided the half of the ball possession on the basketball court into different shooting areas and analyzed the shots in basketball games. Moreover, it used data mining technology to analyze the area where athletes shoot the best shots, thus providing advice for coaches on how to arrange the best players in the offense. The literature [14] proposed the concept of athlete efficiency value and used it to evaluate the athlete’s efficiency per minute. The literature [15] comprehensively considers other players on the basketball court, home and away, and other related factors. By comparing the efficiency of the club when the player is on the court and the efficiency of the club when the athlete is not on the court, it proposed a comprehensive efficiency value that evaluates the value that the athlete produces to the club, and corrects the comprehensive efficiency value. The literature [16] used scientific data mining techniques to organize and summarize the statistical data of baseball and proposed the “Win share” method to evaluate the contribution of athletes in the game, so as to reasonably measure the value of athletes to the club. In summary, experts and scholars have made some explorations and studies on the application of data mining in the operation and management of professional sports clubs, but the current research has the following deficiencies: (1) The application of data mining technology in the field of sports has been concentrated in theoretical research, but the application of specific research to improve the athletic ability of professional sports club athletes is obviously insufficient; (2) In the application of data mining to the improvement of sports ability of professional sports club athletes, most of the existing research results are in the shallower research stage such as data statistics, and there are few application studies on specific program measures, and the research results are generally practical.

3 Discrete data mining

Association rule mining is an important technique in data mining. It analyzes the characteristics of itemset and the relationship between them to extract the relevant and potential association rules existing in large databases. Moreover, it effectively finds a positive correlation between two seemingly unrelated things and applies it to production decisions, which is the driving force behind the development of association rule mining [17].

Association rule item set I ={ i₁, i₂, ⋯ , i_m } is a set containing m different items. Among them, each transaction T contains an item set I to make T ⊆ I. After an item set X ⊆ I is given, the transaction T contains X only when X ⊆ T. The association rule is an implication X ⇒ T, where X ⊆ I, Y ⊆ I and X∩ Y = ∅. Among them, X and Y are called the pre-implication and post-implication of association rules, respectively [18].

A well-known example of association rules is that 90% of customers who bought hamburgers also bought Coke. 90% is the confidence of the association rule, which means that 90% of transactions containing X (hamburger) also contain Y (Coke). It can be seen that support and confidence are important concepts that are indispensable for describing association rules:

The probability of a transaction containing X and Y in the support transaction database D is calculated by the following formula [19]: $Support (X \Rightarrow Y) = Support (X \cup Y)$ (1)

Confidence The percentage of transactions that contain both X and Y in transaction database D in transactions that contain X is calculated by the following formula: $\begin{matrix} Confidence (X \Rightarrow Y) = \\ Support (X \cup Y) / Support (X) \end{matrix}$ (2)

In the formula, Support (*) represents the support degree of the item set * in the transaction database D, which is the probability of the transaction containing * in D.

From the definitions of support and confidence, it can be seen that the support of association rules measures the weight of the association relationship between item sets, while the confidence of association rules measures the degree of association between item sets [20]. The process of mining association rules is to find all association rules whose support is not less than the minimum support threshold and whose confidence is not less than the minimum confidence threshold, that is, the association rules that meet the following conditions: $\begin{matrix} Support (X \Rightarrow Y) ⩾ min support \\ Confidence (X \Rightarrow Y) ⩾ min confidence \end{matrix}$ (3)

Other concepts related to association rule mining:

Candidate n- item set: If the item set X contains n items, the item set X is called the candidate n- item set, and is recorded as C_n [21];

Frequent n-item set: The n-item set with support greater than the minimum support threshold is called frequent n-item set and is denoted as L_n.

Topological equivalence: If there is a continuous function f : X → Y and its inverse function f^-1 : Y → X is also continuous, the space X and Y are said to be topologically equivalent or homeomorphic.

Cell (cell): When a p-dimensional cell α and an open sphere B ={ x ∈ R^p : ∥ x ∥ < 1 } with dimension p are fixed to homeomorphism, the cell is denoted as α^(p), which means a cell with dimension p.

CW-complex: A CW-complex is any topological space where there is a finite nested sequence ∅ ⊂ X₀ ⊂ X₁ ⊂ ⋯ ⊂ X_n = X. Among them, for i = 0, 1, 2, ⋯ , n, X_i is obtained by bonding a cell to X_(i-1). Among them, X₀ represents a 0-dimensional cell (vertex), and the bonding operation requires that all edges of the bonding cell should be bonded to X_(i-1). For example, X ∪ _fσ represents the result of bonding element σ to X, and f represents an equivalent relationship.

For example, a torus can be regarded as a cell complex, which is obtained by the following steps:

Hasse diagram: A Hasse diagram of a cell complex K is a directional pseudograph H:

Each node of H represents a cell of K;

The links connecting nodes in H represent the neighboring cells of K, and the source node of each link is the node with the highest dimension.

Figure 2 shows a Haas diagram of a cell complex K (triangle).

Fig. 1

Construction of a torus containing 4 cells.

Fig. 2

Hass diagram of cell complex K.

Manifold: A manifold is a topological space, and every point in the space has a neighbor homeomorphic to Rⁿ or R₊ × Rⁿ⁺¹.

Homology groups: For each p, the p-dimensional homology group H_p = Ker∂_p/Im∂_p+1 is obtained by equating two p- rings with only p- boundary different: $\begin{matrix} \forall z^{(p)}, t^{(p)} \in Ker \partial_{p}, z^{(p)} \equiv t^{(p)} \Leftrightarrow \\ z^{(p)} - t^{(p)} \in Im \partial_{p + 1} \end{matrix}$ (4)

These homology groups are commutative and finitely generated (cell complexes are finite), so they can be written as $H_{p} = Z_{2}^{β_{p}}$ . Among them, β_p is the p-th Betty number whose coefficient belongs to Z₂.

On a known cell complex, the discrete Morse function is a real mapping function. It assigns a larger value to each higher-dimensional cell, and it assigns a larger value to the lower-dimensional cell at most once. Its precise definition is as follows: $\begin{matrix} # {τ^{(p + 1)} ≻ α^{(p)} : f (τ) ⩽ f (α)} ⩽ 1 and \\ # {v^{(p - 1)} ≺ α^{(p)} : f (v) ⩾ f (α)} ⩽ 1 \end{matrix}$ (5)

In other words, the cardinality of these two expressions is at most 1.

The discrete Morse function maps a single value for each simplex of the cell complex K instead of a continuous function on K. The definition of the discrete Morse function gives a basic principle for judging whether there is a discrete Morse function: In a cell complex, the number of cells with a higher dimensional function value lower than the function value of a lower dimensional cell connected to it is at most one. Conversely, the number of cells with a lower dimensional cell function value higher than the cell with a higher dimensional cell connected to it is at most one. Figure 3 shows an example of a discrete Morse function.

Fig. 3

The function f defined on the element complex K.

According to the definition of the discrete Morse function, the function f represented by (i) in Fig. 3 is not a discrete Morse function, and the function represented by (ii) is a discrete Morse function. The reason is that the edge f^-1 (0) in (i) has two vertices with smaller dimensions than it, but these two vertices have larger function values, which does not satisfy the second condition in the discrete Morse function. Similarly, the dimension of vertex f^-1 (5) is lower than its two adjacent sides, but its function value is larger than that of its two adjacent sides, which violates the first condition in the discrete Morse function. However, the distribution of all function values in (ii) satisfies the two conditions of the discrete Morse function.

An important concept in the discrete Morse function is the critical cell. If the function value of a cell is larger than the function value of the lower dimensional cell and smaller than the function value of the higher dimensional cell, then the cell is called a critical cell. The specific definitions are as follows:

Critical cell A p-dimensional cell α^(p) is critical if the following conditions are true: $\begin{matrix} # {τ^{(p + 1)} ≻ α^{(p)} : f (τ) ⩽ f (α)} = 0 and \\ # {v^{(p - 1)} ≺ α^{(p)} : f (v) ⩾ f (α)} = 0 \end{matrix}$ (6)

The cell α is called the critical unit.

If K is a simple complex with the Morse function f, then for any unit α there is the following formula: $\begin{matrix} # {τ^{(p + 1)} ≻ α^{(p)} : f (τ) ⩽ f (α)} = 0 or \\ # {v^{(p - 1)} ≺ α^{(p)} : f (v) ⩾ f (α)} = 0 \end{matrix}$ (7)

The discrete Morse function shows the topological information of the cell complex, and the Morse inequality proved by Morse clarifies more information about the topological properties. Morse inequalities include strong Morse inequalities and weak Morse inequalities.

Strong Morse inequalities For a given finite element complex K, any discrete Morse function f defined on it satisfies; $\begin{matrix} \forall p, m_{p} (f) - m_{p - 1} (f) + \dots \pm m_{0} (f) ⩾ \\ β_{p} (k) - β_{p - 1} (k) + \dots \pm β_{0} (k) \end{matrix}$ (8)

Among them, m_p (f) represents the number of p-dimensional critical cells in the discrete Morse function f, and β_p (k) represents the p-th Betty number in the cell complex K. The strong Morse inequality emphasizes that the Euler characteristic number of the element complex represented by the number of critical elements in the discrete Morse function is greater than or equal to the Euler characteristic number of the element complex represented by the Betty number. The Euler characteristic number is a topological invariant that represents the complex topology of the element. The Euler characteristic number of a two-dimensional polyhedron = the number of vertices-the number of edges + the number of faces.

Weak Morse inequalities (Weak Morse inequalities) For a given finite element complex K of dimension n, any discrete Morse functions above it satisfy: $\begin{matrix} \forall p, m_{p} (f) ⩾ β_{p} (k) \\ χ (k) {= #}_{n} (k) - #_{n - 1} (k) + \dots \pm #_{0} (k) \\ = m_{n} (f) - m_{n - 1} (f) + \dots \pm m_{0} (f) \\ = β_{n} (k) - β_{n - 1} (k) + \dots \pm β_{0} (k) \end{matrix}$ (9)

Among them, χ (k) is the Euler characteristic and # _p (k) is the number of K-dimensional units.

The weak Morse inequality shows that the Euler characteristic number of the element complex K can be expressed by the number of critical elements, Betty numbers, and the number of i-dimensional elements in K. Therefore, compared with the weak Morse inequality, the strong Morse inequality gives the topology information of the element complex K more strictly.

4 Information gain rate

Information gain rate is an important concept in the C4.5 algorithm. When building a decision tree model, attribute selection is performed by calculating the information gain rate of each attribute. We set the data set collection to D, and the data sample in the data collection D to d, the data set D has m different class attributes, these class attributes have different values, and they are labeled Ci (i = l, 2,..., m). Therefore, the amount of classified information can be expressed by the following formula: $Info (D) = \sum_{i = 1}^{m} p_{i} {log}_{2} p_{i}$ (10)

The proportion of C_i is represented by p_i, and p_i can be calculated by C_i/d. The reason for using a logarithmic function based on 2 in this study is that the information is encoded in binary.

We assume that one of the attributes is denoted as A, A has v different values, and these values can be expressed as {a₁, a₂, …, a_v}. Therefore, the data set D can be divided into v different subsets by the A attribute, which is denoted as {D₁, D₂, …, D_v}. Among them, D_j (j = 1, 2, …, v) is represented as a set of samples with the same value a_j (j = 1, 2, …, v) on the attribute A. We assume that c_ij represents the total number of samples belonging to category C_i in subset D_j. The information entropy formula for calculating attribute A is as follows: $Info (D) = \sum_{j = 1}^{v} \frac{c_{1 j} + c_{2 j} + . . .}{d}$ (11)

In the above formula, $\frac{c_{1 j} + c_{2 j} + . . .}{d}$ represents the ratio of the number of samples whose value of attribute A is a_j (j = 1, 2, …, v) to the total number of samples_, The formula for calculating the amount of information of D_j is as follows: $Info (d_{1 j}, d_{2 j}, . . ., d_{mj}) = \sum_{i = 1}^{m} p_{ij} {log}_{2} (p_{ij})$ (12)

In the above formula, $p_{ij} = \frac{c_{ij}}{d_{j}}$ indicates the proportion of the sample in the data set D_j to the category C_i.

Therefore, the information gain of attribute A can be calculated by the following formula: $Gain (A) = Info (D) - {Info}_{A} (D)$ (13)

The entropy SplitInfo _A (D) of the attribute A is calculated, and the calculation formula is as follows: $SplitInfo_{A} (D) = \sum_{j = 1}^{y} p_{i} {log}_{2} p_{i}$ (14)

In the above formula, p_i represents the proportion of the data whose attribute A is a_j in the entire data set, which can be calculated by d_j/d. Therefore, the information gain rate formula of the attribute A can be obtained, as shown in the following formula: $GainRatio (A) = \frac{Info (D) - {Info}_{A} (D)}{SplitInfo_{A} (D)}$ (15)

The strategy of the C4.5 decision tree classification algorithm is to calculate the information gain rate of all test attributes in the candidate data set, and use the attribute of the maximum information gain rate as the current division attribute, and finally complete the construction of the decision tree by iterating the above process.

The improved estimation process of the information gain rate is as follows:

If it is assumed that, for a two-dimensional random variable (X, Y), there is an expectation: $E {[X - E (X)] [Y - E (Y)]}$ (16)

then this expectation is called the covariance of the random variable X and the random variable Y and is written as: $Cov (X, Y) = E {[X - E (X)] [Y - E (Y)]}$ (17)

If the two-dimensional random variable (X, Y) is a discrete random variable, its probability distribution is $P {X = x_{i}, Y = y_{j}} = p_{ij} (i, j = 1, 2, . . .)$ (18)

Then, the covariance of the random variable X and the random variable Y is:

$Cov (X, Y) = E {[x_{i} - E (X)] [y_{j} - E (Y)]}$ (19)

If the two-dimensional random variable (X, Y) is a discrete random variable, its probability distribution is: All discrete. $P = f (x, y)$ (20)

Then, the covariance of the random variable X and the random variable Y is:

$\begin{matrix} Cov (X, Y) = \int_{\infty}^{+ \infty} \int_{\infty}^{+ \infty} \\ E {[x - E (X)] [y - E (Y)]} f (x, y) dxdy \end{matrix}$ (21)

Through the study of mathematical expectations, we can know the nature of mathematical expectations. Then, we can use this to simplify the covariance, as follows $\begin{matrix} Cov (X, Y) = E {[X - E (X)] [Y - E (Y)]} \\ = E (XY) - E (X) E (Y) - E (Y) E (X) + E (X) \\ = E (XY) - E (X) E (Y) \end{matrix}$ (22)

The relationship between the variance of the random variable and the covariance of the random variable is: $D (X + Y) = D (X) + D (Y) + 2 cov (X + Y)$ (23)

In particular, when the random variable X and the random variable Y are independent of each other, D (X + Y) = D (X) + D (Y) is the correlation coefficient between the random variable X and the random variable Y.

In the case where it is not easy to confuse, ρ_XY can also be simply written as ρ. In particular, when ρ_XY = 0, the random variable X is said to be uncorrelated with the random variable Y.

The C4.5 algorithm only considers the relevance of class attributes to test attributes but does not consider the relevance of each test attribute. When the correlation between the two attributes is strong, there will be a high degree of redundancy between the two attributes. The sum of the correlation coefficients of one test attribute and all other test attributes is calculated as follows: $ρ = \sum_{f \in F} \frac{Cov (A, f)}{\sqrt{D (A) D (f)}}$ (24)

ρ is the sum of the correlation coefficients of test attribute A and all other test attributes, and represents the correlation between test attribute A and all other test attributes, that is, the redundancy of test attribute A and all other attributes. F means all test attributes except the test attribute A, and f means an element of m, namely f ∈ F. The following formula represents the average correlation coefficient between test attribute A and all test attributes. $\bar{ρ} = \frac{\sum_{f \in F} \frac{Cov (A, f)}{\sqrt{D (A) D (f)}}}{n}$ (25)

Since the selection test attribute standard of the C4.5 algorithm is the information gain rate, in order to balance the impact of other test attributes on the test attribute, this paper adds the average correlation coefficient between the test attribute A and other test attributes on the basis of the original information gain rate. When calculating the information gain rate for the test attribute A, the calculation formula of the improved information gain rate is as follows: $GainRatio = \frac{1}{ρ} \frac{Gain (A)}{SplitI (A)}$ (26)

It can be seen from the above formula that the lower the correlation between the test attribute A and other test attributes, that is, the smaller the redundancy, the smaller the ρ, and the larger the information gain rate calculated. It can well solve the problem of high redundancy between attributes.

The improved algorithm is simplified and the final result is:

$\begin{matrix} Gain Ratio (A) \\ = \frac{1}{ρ} \frac{\sum_{i = 1}^{m} c_{i} (c_{i} - d) - d \sum_{j = 1}^{v} \sum_{i = 1}^{m} \frac{c_{ij} (c_{ij} - d_{j})}{d_{j}}}{\sum_{j = 1}^{v} d_{j} (d_{j} - d)} \end{matrix}$ (27)

It can be seen from the above formula that there is no longer a logarithm operation in the calculation of the information gain rate of the improved C4.5 algorithm, and it is replaced by the addition operation, subtraction operation and division operation that are relatively easy to deal with by the computer, so the calculation efficiency of the C4.5 algorithm is greatly improved. At the same time, because the improved C4.5 algorithm not only considers the multi-value bias problem when calculating the information gain, but also considers the redundancy between attributes, the improved C4.5 algorithm can select a more reasonable attribute value.

The process of constructing a decision tree of the improved C4.5 decision tree generation algorithm is shown in Fig. 4. The process of the improved C4.5 algorithm is as follows:1)Data preprocessing includes continuous attribute discretization, missing values, filling and other links, and formula (27) is used to calculate the average redundancy of each attribute and other attributes. 2) The algorithm continuously selects test attributes, divides the training set, generates nodes, marks nodes, adds nodes to the decision tree, and finally outputs the results.3)When all the instances in the divided subset meet the following test conditions: all belong to the same class, or all have the same attribute value, or the number of instances is less than a certain threshold, the node is identified as a leaf node. Otherwise, the node is a non-leaf node and is identified as the selected test attribute.4)According to the calculation result of formula (27), the information gain rate of the test attribute is calculated, and the best test attribute is selected. Moreover, the algorithm is called recursively until the node is added to the decision tree. The specific improvement process is shown in Fig. 4.

Fig. 4

C4.5 algorithm improvement process.

5 Experimental analysis

Data cleaning means that some special values in the data file that can affect the results of statistical analysis are processed. Therefore, the work of data cleaning includes: filling in missing data; processing noisy data to make it as smooth as possible; handling outliers; and solving inconsistencies. In the database of this research system, due to the large amount of data and the problem of data quality, some attributes in the data will have null values, and some attribute values will be incorrect. These problems can be solved by data cleaning technology. The pre-processed data obtained by the tester is shown in Tables 1 and 2.

Table 1
Physical examination data of students after data cleaning (male)

Height Body weight Vital capacity Step test Grip Standing long jump

177.56 99.49 4637.92 65.65 57.47 2.94

186.65 56.96 3426.93 97.97 49.09 2.40

181.40 54.24 4525.81 94.94 44.64 3.01

173.42 64.44 3391.58 70.70 34.04 2.54

173.01 67.87 3995.56 76.76 46.46 2.55

170.79 62.22 4841.94 87.87 46.97 2.25

188.47 88.78 3926.88 67.67 55.25 2.46

174.53 94.23 3513.79 63.63 34.44 2.62

195.13 74.64 4862.14 50.50 48.78 3.00

184.12 74.64 4236.95 50.50 56.96 2.14

181.70 97.57 4514.70 79.79 55.55 2.86

188.37 81.31 4793.46 53.53 46.26 2.58

172.71 90.80 4339.97 92.92 45.55 2.67

183.42 77.57 4794.47 54.54 55.75 2.70

Height	Body weight	Vital capacity	Step test	Grip	Standing long jump
177.56	99.49	4637.92	65.65	57.47	2.94
186.65	56.96	3426.93	97.97	49.09	2.40
181.40	54.24	4525.81	94.94	44.64	3.01
173.42	64.44	3391.58	70.70	34.04	2.54
173.01	67.87	3995.56	76.76	46.46	2.55
170.79	62.22	4841.94	87.87	46.97	2.25
188.47	88.78	3926.88	67.67	55.25	2.46
174.53	94.23	3513.79	63.63	34.44	2.62
195.13	74.64	4862.14	50.50	48.78	3.00
184.12	74.64	4236.95	50.50	56.96	2.14
181.70	97.57	4514.70	79.79	55.55	2.86
188.37	81.31	4793.46	53.53	46.26	2.58
172.71	90.80	4339.97	92.92	45.55	2.67
183.42	77.57	4794.47	54.54	55.75	2.70

Table 2

Physical examination data of students after data cleaning (female)

Height	Body weight	Vital capacity	Step test	Grip	Standing long jump
163.62	72.72	3523.89	50.50	6.46	1.68
152.41	71.21	3116.86	66.66	16.87	1.55
155.54	52.92	1812.95	49.49	14.75	2.04
174.53	76.05	1668.52	68.68	6.87	1.44
154.03	51.91	3072.42	47.47	25.55	1.64
155.14	47.57	2776.49	66.66	35.96	1.69
172.61	67.37	1812.95	65.65	30.30	1.93
157.46	74.74	2276.54	70.70	24.04	1.91
156.55	77.37	2852.24	49.49	–2.42	2.12
167.26	50.80	2963.34	51.51	3.84	1.69
167.86	75.55	1773.56	49.49	31.92	1.24
165.24	75.35	3432.99	51.51	20.40	1.32
162.51	56.66	2867.39	45.45	–4.14	1.95
172.61	56.36	3512.78	49.49	31.71	1.34

According to the evaluation of the machine learning method in this article, the final score is obtained. The core of data conversion is to normalize the data. All continuous attributes are discretized, that is, the data of students’ physical health assessment are discretized. The results are shown in Tables 3 and 4, and the statistical diagrams are shown in Figs. 5 and 6.

Table 3

Student physical examination score (male)

Height and weight	Vital capacity	Step test	Grip	Standing long jump	Total score
50.5	10.1	87.9	60.6	100.0	70.7
50.5	63.6	100.0	90.9	75.8	86.9
50.5	99.0	100.0	84.8	100.0	97.0
100.0	40.4	90.9	40.4	81.8	74.7
100.0	63.6	94.9	69.7	84.8	86.9
100.0	87.9	100.0	78.8	63.6	91.9
50.5	10.1	90.9	63.6	78.8	67.7
50.5	10.1	84.8	0.0	90.9	55.6
100.0	72.7	66.7	69.7	100.0	83.8
100.0	60.6	66.7	78.8	50.5	72.7
50.5	10.1	97.0	60.6	100.0	73.7
60.6	63.6	75.8	60.6	84.8	75.8
50.5	10.1	100.0	30.3	97.0	67.7
60.6	10.1	87.9	60.6	100.0	70.7

Table 4

Student physical examination score (female)

Height and weight	Vital capacity	Step test	Grip	Standing long jump	Total score
50.5	63.6	75.8	63.6	63.6	70.7
50.5	60.6	92.9	84.8	40.4	74.7
100.0	10.1	75.8	78.8	92.9	73.7
50.5	10.1	92.9	66.7	10.1	55.6
100.0	81.8	69.7	100.0	60.6	84.8
60.6	78.8	92.9	100.0	63.6	87.9
100.0	10.1	92.9	100.0	81.8	81.8
50.5	10.1	94.9	100.0	81.8	76.8
50.5	10.1	75.8	0.0	100.0	55.6
60.6	78.8	75.8	60.6	63.6	74.7
50.5	10.1	75.8	100.0	0.0	55.6
50.5	60.6	75.8	94.9	0.0	63.6
100.0	66.7	63.6	0.0	84.8	64.6
60.6	84.8	75.8	100.0	0.0	70.7

Fig. 5

Statistical diagram of students’ physical scores (male).

Fig. 6

Statistical diagram of students’ physical scores (female).

The data source is the physical examination results of a college sports major. The data preparation is completed through five steps: data collection, data preprocessing, data integration, data cleaning, and data conversion. Moreover, the obtained data is used to construct a decision tree. Since the data includes male physical fitness evaluation data and female physical fitness evaluation data, two decision trees are generated: one is the male physical evaluation decision tree and the other is the female physical evaluation decision tree. Through the analysis of the two decision trees, it is found that the biggest factor affecting male is the vital capacity, and the biggest factor affecting female is the step test. Therefore, when improving the physical health of students, different measures should be formulated for male and female.

6 Conclusion

At present, computer networks have become very popular in universities. Using data mining technology, the establishment of a college sports performance management system can provide administrators, teachers, and students with sufficient information and quick query methods, complete teacher registration work, and perform statistics, analysis, and processing of data. Moreover, the decision tree can be used to realize data mining on the key links that affect the performance of learning sports. This study analyzes various decision tree algorithms, finds the algorithms that are specific to the characteristics of the Sun Sports system, and conducts in-depth learning and research, and finds suitable improvement methods to make the improved algorithms more accurate and efficient. After studying and comparing, it is found that the C4.5 algorithm can better meet the requirements. This study improves the algorithm from the following aspects: The one is to improve the discretization method and improve the efficiency of the algorithm. The second is to add correlation coefficients as parameters to make the selected test attributes more reasonable. The third is to simplify the improved information entropy calculation formula to further simplify the calculation and improve efficiency.

References

Keating

X.D.

, Research on Preservice Physical Education Teachers’ and Preservice Elementary Teachers’ Physical Education Identities: A Systematic Review.[J], Journal of Teaching in Physical Education 36(2) (2017), 1–29.

Yli-Piipari

, Physical Education Curriculum Reform in Finland[J], Quest Illinois National Association for Physical Education in Higher Education 66(4) (2014), 468–484.

Lindberg

, Seo

and Laine

T.H.

, Enhancing Physical Education with Exergames and Wearable Technology[J], IEEE Transactions on Learning Technologies 9(4) (2016), 12–19.

Landi

, Fitzpatrick

and Mcglashan

, Models Based Practices in Physical Education: A Sociocritical Reflection[J], Journal of Teaching in Physical Education 35(4) (2016), 400–411.

Ardoy

D.N.

, Fernández-Rodríguez

J.M.

, Jiménez-Pavón

, et al., A Physical Education trial improves adolescents” cognitive performance and academic achievement: the EDUFIT study[J], Scandinavian Journal of Medicine & Science in Sports 24(1) (2014), e52–e61.

Erfle

S.E.

and Gamble

, Effects of Daily Physical Education on Physical Fitness and Weight Status in Middle School Adolescents[J], Journal of School Health 85(1) (2015), 27–35.

Cheon

S.H.

, Reeve

, Yu

T.H.

, et al., The Teacher Benefits From Giving Autonomy Support During Physical Education Instruction[J], Journal of Sport & Exercise Psychology 36(4) (2014), 331–346.

Bendiksen

, Williams

C.A.

, Hornstrup

, et al., Heart rate response and fitness effects of various types of physical education for 8- to 9-year-old schoolchildren[J], European Journal of Sport Science 14(8) (2014), 861–869.

Hollis

J.L.

, Williams

A.J.

, Sutherland

, et al., A systematic review and meta-analysis of moderate-to-vigorous physical activity levels in elementary school physical education lessons[J], Preventive Medicine 86(1) (2015), 34–54.

10.

Wang

J.C.K.

, Morin

A.J.S.

, Ryan

R.M.

, et al., Students’ Motivational Profiles in the Physical Education Context[J], Journal of Sport and Exercise Psychology 38(6) (2016), 612–630.

11.

Fletcher

and Casey

, The Challenges of Models-Based Practice in Physical Education Teacher Education: A Collaborative Self-Study[J], Journal of Teaching in Physical Education 33(3) (2014), 403–421.

12.

Oudah

, The Nature of personal characteristics of the teaching faculties of Physical Education in the southern region from the view of their students[J], Physica D-nonlinear Phenomena 26(1–3) (2014), 181–192.

13.

Aelterman

, Vansteenkiste

, Lynn

V.D.B.

, et al., Fostering a Need-Supportive Teaching Style: Intervention Effects on Physical Education Teachers’ Beliefs and Teaching Behaviors[J], Journal of Sport & Exercise Psychology 36(6) (2014), 595–609.

14.

Goossens

, Verrelst

, Cardon

, et al., Sports injuries in physical education teacher education students[J], Scandinavian Journal of Medicine & Science in Sports 24(4) (2014), 683–691.

15.

López Jiménez

, Valero-Valenzuela

, Anguera

M.T.

, et al., Erratum to: Disruptive behavior among elementary students in physical education[J], Springer Plus 5(1) (2016), 1364–1369.

16.

Zhou

and Tan

, Electrocardiogram soft computing using hybrid deep learning CNN-ELM, Appl Soft Comput 86 (2020).

17.

Sun

, Li

and Shen

, Learning in Physical Education: A Self-Determination Theory Perspective[J], Journal of Teaching in Physical Education 36(3) (2017), 277–291.

18.

Lewis

, Pupils∖” and teachers∖” experiences of school-based physical education: a qualitative study[J], Bmj Open 4(9) (2014), e005277–e005277.

19.

Viciana

and Mayorga-Vega

, Innovative teaching units applied to physical education - Changing the curriculum management for authentic outcomes[J], Kinesiology 48(1) (2016), 142–152.

20.

Scrabis-Fletcher

, Rasmussen

and Silverman

, The Relationship of Practice, Attitude, and Perception of Competence in Middle School Physical Education[J], Journal of Teaching in Physical Education 35(3) (2016), 241–250.

21.

Hastie

P.A.

and Wallhead

, Models-Based Practice in Physical Education: The Case for Sport Education[J], Journal of Teaching in Physical Education 35(4) (2016), 390–399.