Abstract
The main purpose of the various methods of evaluating athlete feature recognition is to monitor the current health of the athletes, thereby providing some feedback on the quality of individual training. Based on deep learning and convolutional neural networks, this paper studies athlete target recognition and proposes a feature vector extraction method based on curvature zero point. Moreover, based on the ideas of deep learning and convolutional neural networks, this paper builds an athlete feature recognition model and optimizes the algorithm. In order to verify the feasibility and efficiency of feature extraction algorithm of the sport athletes proposed by this paper and to facilitate comparison with other algorithms, this paper conducts an algorithm performance test on the sport athlete database. The research results show that the method proposed in this paper has certain advantages in the feature extraction of athletes and can be used in subsequent sports training systems.
Introduction
With the rapid development of computer technology and the Internet, people have given computers more and more missions, so that computers can replace or assist humans in all directions to achieve intricate and human-incapable things. There are various ways for humans to obtain information from the outside world, and humans mainly obtain external information and perceive changes in information from the aspects of sight, hearing, smell, and touch. Among them, vision [1] is the primary way for humans to understand the world from the senses and obtain information. It not only refers to the perception of light source signals, but also includes the acquisition, processing, transmission, storage and understanding of signals. Therefore, giving computer vision ability is the most urgent and most important ability. Multiple disciplines such as machine learning, image processing, pattern recognition, and artificial intelligence are included in computer vision research. Its main scientific research direction is to process and analyze many pictures and videos and other data, so that the computer can use vision to obtain picture information and make corresponding accurate judgments on it [2]. The purpose is to enable computers to have similar self-analysis and judgment skills as humans, and to be able to complete some work instead of humans. Human behavior recognition based on computer vision [3] is one of many research directions that have received much attention in recent years. It combines technologies such as computer vision and behavioral pattern analysis and understanding, and its main purpose is to detect, identify, and understand research objects of interest with video images of human targets. Due to its broad research prospects and application fields, it has been extensively studied by more and more researchers in recent years and has obtained many research results. These are undoubtedly major breakthroughs in the field of human behavior recognition applications. Human behavior recognition technology refers to a technology that combines machine vision and video image processing and analysis to perform in-depth intelligent analysis of the collected image sequence and automatically determine the specific behavior of the human target. According to the detailed information reflected in the data information of the video image, it is analyzed and judged in detail [4].
Related work
The literature [5] used an RGBD sensor (Microsoft Kinect) as an input sensor and calculated a set of features based on human posture and motion, and image and point cloud information. Their algorithm used a Markov model based on hierarchical maximum entropy. The model regarded a person’s activities as consisting of a set of self-activities and uses dynamic programming to infer a two-layer graph structure. The experiments proved the effectiveness of the algorithm. The literature [6] provided an effective capture technique to accurately estimate the position of 3D skeleton joints from a single depth map and to identify the human actions in the depth image sequence by constructing a local popular frame of the data features of the skeleton nodes.
The literature [7] proposed a novel feature extraction method. This method extracts semi-local features from the depth image, which is called a random occlusion module (ROP). It can effectively search a very large sampling space, and the sparse coding method was still used to encode these features. The literature [8] proposed a feature expression for RGB.D human motion recognition and used the time pyramid matching method ScTPM based on sparse coding. For RGB video, this paper adopted the spatio-temporal feature descriptor CS.M1tD of the central symmetric motion local ternary mode. For the three-node feature, combined with the ScTPM expression method and feature descriptor CS.M1tp proposed in this paper, the color and depth information were fused at the feature layer and the score layer, respectively. The literature [9] combined bone nodes and depth images for motion recognition and used the angle model of bone nodes and the spatiotemporal features of depth images to construct new descriptors. The proposed descriptor is easy to implement, generates a feature set of relatively small size, and has a fast classification scheme, which is suitable for real-time applications. The literature [10] also tried the fusion of RGB images and depth images. First, the depth difference motion history image (DDMHI) was obtained from the depth image, and the same method was used to obtain the full image difference motion history image (WRDMHI) from the RGB image. After that, the motion features were extracted and fused effectively. The literature [11] mapped the depth image to three two-dimensional Cartesian planes and formed a depth motion map (DMM) to describe the action by the difference between the frames outside the threshold range. Then, the direction gradient histogram (HOG) was extracted on the DMM. Finally, the classic LIBSVM classifier was used for classification, which has achieved very good results.
In recent years, the tremendous increase in computer computing power has enabled researchers to process visual data with advanced deep learning algorithms at an acceptable speed. Therefore, the automatic feature learning method, namely deep learning, has also been widely used in skeleton motion recognition. Among them, the most applied is the structure based on recurrent neural network (RNN). The most widely used structure is the structure based on recurrent neural network (RNN). In recent years, some machine learning methods have been used for behavior recognition with good results. For example, literature [13] proposed to use an improved convolutional neural network to recognize behavior, and literature [14] proposed a 3D convolutional neural network to recognize behavior recognition methods. The literature [15] proposed an end-to-end hierarchical RNN (HBRNN) structure for skeleton action recognition and divided the body into 5 parts based on the structure of the human body. The result achieved better results than the previous manual feature method on the data set MSRAction3D. The literature [16] proposed the Co- occurrence LSTM. The Co-occurrence LSTM network does not need to deliberately group joints, it can automatically select the joint points that have discriminative power to the sequence, and it achieves a result about 10% higher than HBRNN on the data set SUB Kinect interaction In addition, Convolutional Neural Network (CNN) is also applied to skeleton action recognition. The literature [17] proposed three different action recognition models based on deep learning: Time Selection LSTM, Spatio-Temporal CNN and Joints Selection CNN. Moreover, the literature has conducted a large number of experiments on several public databases for each model and obtained effective results.
Fourier-curvature space descriptor
The literature [19] applied Fourier descriptors to object shape feature extraction, and the literature [20] also applied Fourier descriptors to human body recognition. The Fourier descriptor is the bilateral Fourier transform coefficient under the mapping of the two-dimensional real coordinate system of the human edge curve to the complex coordinate system and is also the frequency domain analysis result of the human edge curve signal. Because the contour curve of the human body edge can be regarded as a periodic function of a two-dimensional plane, it can also satisfy the Fourier transform condition. The Fourier transform coefficient itself does not have rotation, translation, or scale invariance, and it needs to be normalized further.
For any sampling point s
k
(μ
k
, v
k
) on the edge curve of the human body, the mapping to the complex plane is s
k
= μ
k
+ jv
k
. All sampling points {s
k
, k = 1, 2, 3, ⋯ , N } mapped on the complex plane are subjected to bilateral Fourier transform [21]:
The one-dimensional vector composed of Fourier coefficients a (f) is the Fourier descriptor. By inverse transforming it and restoring back to the real coordinate system, the edge curve of the original human body contour image can be obtained. Similar to the Fourier transform in the field of signal processing, the low-frequency coefficient of the Fourier transform determines the overall shape of the contour curve of the human body, and the high-frequency coefficient reflects the details of the shape. When retaining the appropriate number of coefficients, the algorithm can achieve good robustness and fully preserve shape details. The length of the Fourier descriptor used in this paper is 100 [22].
If the contour curve of the human body has a certain translation in a certain direction, the coordinate value in the complex plane increases by a constant: s
k
= μ
k
+ Δx + j (v
k
+ Δy) = s
k
+ Δxy. According to the nature of the Fourier transform, when the function increases by a constant, the DC component a (0) of the Fourier transform result increases by an amount of impact [23]:
The translation invariance of the Fourier descriptor can be achieved by discarding the a (0). When the body contour rotates, the edge curve rotates by a certain angle θ, and the complex plane coordinate becomes s
k
e
jθ
. It can also be seen from the nature of the Fourier transform that changes in rotation introduce changes in phase [24–26]:
By selecting the amplitude value of the Fourier transform coefficient and ignoring its phase value, the purpose of rotation invariance can be achieved. Similarly, when scale conversion occurs, the scale conversion factor λ is introduced, and the complex plane coordinate becomes λ
s
k
. At this time, the Fourier transform is as follows:
At this time, the Fourier transform result is that each Fourier coefficient is multiplied by the scale conversion factor λ. In order to eliminate the influence of scale transformation on the Fourier descriptor, each Fourier coefficient can be divided by a (1). At this point, the Fourier descriptors for rotation, translation, and scale invariance can be obtained:
The curvature is most often calculated in the form of a curve parameter equation.
Let the plane curve be Γ and its parametric equation be μ = θ (t) , v = ψ (t), define the curve curvature:
After introducing Gaussian smoothing, the curve is convolved with one-dimensional Gaussian functions of different scales, that is
From the differential nature of the convolution operation, the curvature formula becomes (* is the convolution operation):
For the contour curve of the human body, it is a discrete point, so it is impossible to list the parameter equations. Therefore, the curvature should be directly given by the coordinates of the edge neighborhood points:
Among them,
With the evolution of the scale σ, the curvature of the curve after Gaussian smoothing also changes. The core idea of the curvature scale space is to record the change process of the zero point of curvature to form a characteristic map of the curvature scale space corresponding to the curve. This paper proposes a feature vector extraction method based on the zero point of curvature.
Each sub-picture in Fig. 1 shows the change of the curvature zero point corresponding to the contour curve of the human body contour after smoothing with different Gaussian scales. The “+” sign in the sub-picture is the curvature zero mark, indicating that the curvature at the change point is zero. It can be found from the figure that as the Gaussian scale continues to increase, the number of zero points of curvature of the smoothed curve continues to decrease. For the given human body contour example, when σ = 5, the corresponding number of curvature zero points is zero. In addition, the point marked by the rectangular frame in the curvature scale space feature map is the maximum value reference point in the traditional recognition algorithm, and the image recognition is completed according to the corresponding position k and relative distance of the reference point.

Schematic diagram of the spatial characteristics of the curvature scale of the human body edge curve.
The number of curvature zero points at different Gaussian scales is extracted from the histogram in Fig. 1 to form a curvature scale space feature vector of the human body contour image, which is used for human body contour recognition. It can be seen from the table that the feature vector based on the zero point of curvature proposed in this paper is effective in the human body contour recognition task, and it shows no relation with the length of the feature vector. That is, increasing the degree of smoothing (increasing the value of σ) at a certain step size does not affect the final recognition result. Therefore, σmax = 5, len = 9 can be selected as the feature vector according to the data in the table. In addition, fixing the maximum value of σ and reducing the step size have no obvious effect on the recognition results.
Since image retrieval does not need to strictly output its category, the retrieval rate is often higher than the recognition accuracy rate. According to the citation data, it can be inferred that when the retrieval degree tends to be recognized, the retrieval rate is far below 39.29%, which also shows that the method proposed in this section is reasonable.
Gradient as a vector parameter plays an important role in the numerical analysis of multivariate nonlinear functions, and gradient is also the basis of related algorithms of fully connected neural networks. Therefore, it is necessary to introduce the gradient and its application in neural networks.
We assume that a certain surface z = f (x, y) in three-dimensional space is defined in a certain neighborhood of p0 (x0, y0). The angle between the ray l induced by p0 and the positive direction of the x axis is φ, The direction of l is used as the reference direction and is denoted as
exists, it is the directional derivative of the function z = f (x, y) at point p0 (x0, y0) along the direction
If the function z = f (x, y) is differentiable in the neighborhood of p0 (x0, y0), that is, the two sides of the function increment:
are divided by t and taken limit to obtain the following result:
The limit form is written as a vector form:
ɛ is an important parameter in machine learning: learning rate (step size). In practical applications, it can be considered that the main purpose of the existence of the gradient vector is to indicate the direction in which the function falls, and the parameter that really affects the next value is actually the learning rate Therefore, each iteration of the gradient descent algorithm is greatly affected by the learning rate. If the learning rate is not set properly, it will cause difficulty in the convergence of the cost function.
Considering that the gradient direction has specified the search direction, the original multivariate function optimization problem is transformed into a line search problem. That is, with the learning rate ɛ as an independent variable, the original target multivariate function
Since
Equation (15) specifies the gradient constraint of the search point from the gradient angle, that is, the gradient at the search point should be greater than or equal to σ times the initial gradient. According to the nature of gradient descent, the gradient of the selected search point should not be greater than the gradient of the initial point, and σ < 1 constrains the initial point, the above formula holds, and the lower limit of ɛ k is given. The learning rate satisfying Equations (14) and (15) can be considered as the optimal value. By applying the Wolfe-Powell criterion to the neural network, the adaptive of learning rate ɛ can be achieved.
We assume that an input 264-dimensional feature vector is
The assumed model function after using logical function constraints is:
One of the important reasons for choosing this logic function is its derivative characteristic. As known from the foregoing, the gradient descent method depends on the partial differentiation of the objective function, and g (z) is differentiable everywhere in the definition domain and the derivative form is simple: g′ (z) = g (z) (1 - g (z)). The derivation of g (z) completely depends on the forward propagation of the neural network, which helps to improve the calculation efficiency.
The cost function is the objective optimization function in machine learning, and it is also the application object of the gradient descent algorithm. The common cost function idea is to calculate the Euclidean distance or square error function. If the hypothetical model
Since the calculation result of
The above formula is derived from a single input. Since the neural network needs to put the training set into the network during the training process, the above formula is changed to:
In the formula, i is the i-th sample in the training set, and y
i
is its corresponding theoretical output value. It can be proved that the new cost function is a convex function, which can be converged to the global optimal value by the gradient descent algorithm. The partial differential formula of
In order to solve the problem of overfitting, we introduce a regularization method, and regularization can effectively reduce the parameter size while retaining all the features. The idea of regularization is: When optimizing the cost function
θ
j
is the jth element in
In a multi-classifier constructed by a fully connected neural network, each input will have k output values and a weight matrix Θ formed by multiple weight vectors
The back propagation algorithm of formula (22) is now given in matrix form. If it is assumed that the fully connected neural network has a total of 4 layers of neurons, then by formula (20), the partial differential of the output value of the cost function of the entire neural network to the input of the output layer is written as:
Among them, out4 represents the logical function constraint term for the input amount of the output layer. It can be further obtained that the partial differential of the output value of the cost function of the neural network to the input value of the third layer is:
In the formula, [out3 · (1 - out3)] is the differential term of the aforementioned logic function. Similarly, the partial differential of the output value of the cost function of the neural network to the input of the second layer is:
The above formulas give the partial differential of the output of the neural network to the input of each layer. From the transfer properties of partial differentials, the partial differential form of the output of the neural network cost function to an element in the weight matrix of the layer 1 neural network is:
After formula (26) is applied to the entire neural network, the partial differentiation of the cost function output for each weight value can be obtained. The ultimate goal of the neural network solution is to find the θ value that makes J (Θ) the minimum value according to the gradient descent algorithm.
In most convex optimization problems, the selection of initial parameters is not required. However, for multilayer neural networks, the selection of initial parameters is conditional. If the initial parameters are all zero, it means that all activation units of the second layer of neural network have the same value. Similarly, if the initial parameter is selected to be the same non-zero value, it will also result in the same result. The above two situations will make the entire neural network into a redundant state.
Generally, random parameters that satisfy a uniform distribution are selected.
rand0,1 means uniform distribution in [0, 1], the above formula is equivalent to uniform distribution in [- epseps]. eps can choose a smaller value, and in this article eps = 0.12.
In order to verify the feasibility and efficiency of the sport athletes feature extraction algorithm proposed in this paper and in order to facilitate comparison with other algorithms, this paper conducts an algorithm performance test on the sport athletes library database.
In this section of the experiment, 40 images were randomly selected as the database for this experiment, that is, the experimental database has a total of 40 categories, 12 images per category, and a total of 480 images. In each category, 6 pictures are selected for training and 6 pictures for testing. The experiment is conducted on the Windows 10 platform, the device memory is 12 G, the CPU was Interl CoreTM i54200U, the software environment is PychalTn.2018.3.5 and MATLAB R2018a, and the relevant code is written in Python3 and MATLAB language.
Experiment 1: The first experiment is to verify that the performance of sports ROI images after texture enhancement and fusion to perform feature extraction is better than that of directly using original sports images to perform feature extraction.
In this experiment, a total of 5 tests were conducted, and the average of the 5 results was taken as the final result. Under the conditions of Gabor filter bank scales of 7×7, 9×9, 11×11, 13×13, the accuracy and feature extraction time-consuming of the original sports image, the texture-enhanced image and the fusion image of the two sports images are tested respectively. The results are shown in Table 1 and Figs. 2 and 3.
Correct rate and feature extraction time-consuming of different image sets
Correct rate and feature extraction time-consuming of different image sets

The accuracy of different image sets.

The feature extraction time-consuming of different image sets.
The effect of Texture enhanced image is 0.1% higher than that of using the original image directly, which is not a big improvement. The feature extraction takes 0.5 s longer than the original sport athletes image feature extraction, and the extra time is used for texture enhancement of the athlete image. The feature recognition accuracy rate of the fused image of sport athletes is 1.4 percentage points higher than the original image and 1.3 percentage points higher than the texture-enhanced image. The time from image texture enhancement to image fusion to feature extraction is about 0.6 s longer than the original image, and it is only about 0.1 second longer than the image after texture enhancement.
In summary, it is feasible to take the next recognition method after the original image and the texture-enhanced sports image are fused, and the effect is better than using the original image for feature recognition. This method is conducive to improving the reliability of sports feature recognition, feature extraction is not time-consuming, and averages 0.05 s per picture, which can be accepted by the public. To a certain extent, the results prove the effectiveness and feasibility of this feature extraction algorithm based on neural network structure.
Experiment 2: In order to further prove the effectiveness and feasibility of the sport athletes feature extraction method in this paper, this experiment 2 is conducted under the same sport athletes ROI image data set and SVM classifier as experiment 1, and make 5 comparison experiments with the recognition performance of the LBP feature extraction algorithm and the PCA dimensionality reduction algorithm. Moreover, the evaluation index is unchanged, and the average value is taken. The results are shown in Table 2 and Fig. 4. It can be seen from Table 3 that the accuracy rate of the sport athletes feature extraction method in this paper is several percentage points higher than that of LBP and PCA algorithms. Since the feature processing stage requires multiple filtering and sampling at multiple scales, the time for feature extraction is higher than LBP and PCA. However, in terms of accuracy, compared with LBP and PCA feature extraction algorithms, the sport athletes feature extraction algorithm in this paper is more reliable.
Test results of different feature extraction algorithms

Test statistical diagram of different feature extraction algorithms.
The correct rate of vein recognition of sportsmen fusion images under the same number of scales
Experiment 3: The number of different scales in the Gabor function filter bank affects the reliability of features and the overall efficiency of the algorithm to some extent. In order to select a suitable filter bank, the accuracy and efficiency of the algorithm in this chapter under different combinations of Gabor function groups and different scales are tested. The size of the scale is 7×7, 9×9, 11×11, 13×13, 15×15, 17×17, 19×19, 21×21, and the number increases in order: 2, 4, 6, 8. This experiment is carried out on the images of sport athletes after fusion. In order to reduce the experimental error, this experiment is carried out a total of 5 times, and the average value after the fusion of the correct rate and feature extraction time of the ROI image of the sports player after fusion is used as the final test result. The results are shown in Table 3, Table 4, Fig. 5, and Fig. 6.
The time-consuming of vein extraction of sportsmen fusion images under different scale numbers

The correct rate of vein recognition of sportsmen fusion images under the same number of scales.

Statistical diagram of the time-consuming of vein extraction of sportsmen fusion images under different scale numbers.
As can be seen from Table 3 and Fig. 5, the algorithm proposed in this paper can effectively complete the sport athletes feature recognition task, and the recognition rate under different filter scale combinations are all above 94%. Through the algorithm performance test on the four sets of different scale numbers, it is found that the recognition rate increases first and then decreases, as shown in Fig. 5. When the Gabor function group in the hidden layer 2 in the neural network proposed in this chapter uses 4 Gabor scales (7×7, 9×9, 11×11, 13×13), the recognition rate is the highest.
As can be seen from Table 4 and Fig. 6, as the number of scales increases, the time for feature extraction is almost doubled. The reason is that as the number of kernels of different scales in the Gabor function filter bank increases, more filtering and sampling operations are required, the number of image operations is more, the number of feature maps generated during the extraction process is more, and the final feature value vector has more detailed information. However, since the resolution of the image of sport athletes itself is not very high, proper scale filtering can effectively extract texture features. If the number of scales is too large, as the scale size increases, some details may be ignored, and a certain degree of overfitting may be caused. Therefore, in Fig. 6 the accuracy rate will increase first and then decrease. Although the feature extraction time is higher than 2.5 times the number of 2 scales when the number of 4 scales, considering that the importance of accuracy rate exceeds the time of feature extraction in practice, this paper believes that 4 different Gabor scales should be selected to form a filter.
In the field of sports recognition based on machine vision, the target recognition algorithm used in auxiliary training equipment has the characteristics of high time complexity, high hardware equipment requirements, complicated operation and expensive equipment. Moreover, it does not have a machine recognition function and requires manual operation, which reduces training efficiency. Therefore, in response to this problem, this paper has conducted in-depth research on the target recognition algorithm based on machine learning. By conducting experiments on the currently commonly used target detection methods, and comparing the experimental results and analysis, this paper combines a sports scene, and proposes a moving target detection method based on deep learning and convolutional neural networks. The algorithm obtains good results in the application scenario of this paper, and has strong anti-interference and noise capabilities, and effectively avoids the problem of complex background changes, which paves the way for the subsequent feature extraction. By summarizing the results of relevant comparative experiments, it can be concluded that the average time taken by this method to complete an Epoch is lower than that of a convolutional neural network. In the case of a small difference in accuracy, the network is smaller and fewer parameters need to be calculated. Moreover, with higher computing power, the method in this paper tends to be processed in real time. Similarly, in embedded systems with lower computing power, the method in this paper has better computing performance than convolutional neural networks.
