Abstract
Sports competition characteristics play an important role in judging the fairness of the game and improving the skills of the athletes. At present, the feature recognition of sports competition is affected by the environmental background, which causes problems in feature recognition. In order to improve the effect of feature recognition of sports competition, this study improves the TLD algorithm, and uses machine learning to build a feature recognition model of sports competition based on the improved TLD algorithm. Moreover, this study applies the TLD algorithm to the long-term pedestrian tracking of PTZ cameras. In view of the shortcomings of the TLD algorithm, this study improves the TLD algorithm. In addition, the improved TLD algorithm is experimentally analyzed on a standard data set, and the improved TLD algorithm is experimentally verified. Finally, the experimental results are visually represented by mathematical statistics methods. The research shows that the method proposed by this paper has certain effects.
Introduction
Athlete monitoring and tracking is a branch of video tracking system. At present, intelligent feature recognition has many uses in intelligent security, intelligent transportation, and military. Athlete tracking is flexible target tracking, and its movement is random, and its application phase is more complicated [1]. Traditional long-term tracking is divided into two forms: target tracking and target detection. The monitoring system is a traditional pedestrian tracking application field. The traditional tracking system algorithm is relatively simple. Generally, only the initialization is needed to estimate the target’s movement, and there is not much calculation in the tracking process to achieve real-time performance. However, the traditional tracking algorithm has a big flaw: when the target reappears, that is, when the target leaves the window and returns to the window, it will be lost [2]. Object detection was not originally used in the field of tracking systems, because it is computationally intensive and requires careful offline learning. However, after decades of development, the resource consumption and calculation speed required by the detection system can be greatly satisfied, so it is often used in the target tracking process. When using the detection method for target tracking, each frame of image needs to be processed. However, the disadvantage of this method is that target detection requires a large amount of offline data for learning, and the detector must know the characteristics of the detected target in order to perform, that is, either the target is modeled or the target is collected in advance. In human tracking, in general, the first step is target detection and then target tracking. The general tracking algorithm can achieve short-term tracking, and the tracking effect is obvious. However, if the time is longer, the target features will also change a lot, such as deformation, lighting, occlusion, size, etc. Therefore, the long-term tracking effect is generally poor. In terms of constituent organizations, human detection and tracking algorithms generally include a detection module and a tracking module. Moreover, many times, offline learning modules are also included in the tracking algorithm. The disadvantage of the offline learning module is that it generally requires a large number of training samples. Once the training samples are incomplete, the tracking target will be lost. Tracking algorithms have matured over the decades, and there are many useful tracking algorithms. TLD is an excellent tracking algorithm that has emerged in the past two years. It can better solve the problem of recapture after occlusion of lost targets. It has been widely used in various fields [3].
Athlete tracking system is one of the research directions of video tracking system. It also belongs to flexible target tracking. Flexible targets are weak targets that can easily deform during movement and can deform themselves. In the field of target tracking, flexible targets have always been at the top of the pyramid, and everyone wants to conquer the highlands. The research on pedestrian tracking systems is also a hot topic in the field of tracking in recent years. Pedestrian tracking needs to study human motion, including separating targets from the background and performing continuous long-term tracking of the targets [4]. It even includes an understanding of human behavior at an advanced stage. Pedestrian tracking can analyze the trajectory of the human body to realize the prediction of human pedestrians, and then make a certain response to human behavior in advance based on the analysis results. Human tracking also belongs to the field of video tracking, so for algorithms commonly used in video tracking, it also achieves effects in pedestrian tracking. However, because of the special nature of the human body, more appropriate algorithms need to be researched. There are also many algorithms that perform continuous tracking processing on each frame through target detection and use this method to maintain the tracking effect.
Related work
Object detection is to extract the object of interest in the video by using the motion foreground, that is, to determine the size and position of the object at the current moment. The method based on background modeling relies on background historical information, builds a background model by learning the background, and then compares the background model with the current frame information to segment the foreground. The method based on target modeling learns a large number of training samples and trains a classifier, and then uses the classifier to perform multi-scale sliding window traversal on the current frame image to find the target.
Since the 1960 s and 1970 s, background modeling technology has been widely concerned by researchers at home and abroad. The literature [5] proposed a simple background subtraction method to use the inter-frame difference method for foreground detection on a fixed scene video. However, this method is not applicable to situations where the background changes or the moving target is large. In order to adapt to the change of background, the background modeling method based on statistical learning has developed rapidly in recent years. The literature [6] proposed a single Gaussian background modeling method. In the literature [7], a hybrid Gaussian modeling method was proposed. The background modeling method based on the mixed Gaussian model takes a single pixel as the basic processing unit. The weighted sum of multiple Gaussian distributions is used to represent the color space distribution of a single pixel, and the object detection capability is significantly improved. Moreover, based on the mixed Gaussian model, many scholars have proposed some improvement methods. The literature [8] proposed an improved parameter update algorithm for hybrid Gaussian models based on detecting shadows of moving targets. The literature [9] proposed a non-parametric algorithm that uses the kernel density estimation method to analyze the historical information of pixels and extract the foreground target by maximizing the posterior probability. The Bayesian model proposed in the literature [10] estimates the probability density distribution of the background pixels and the foreground pixels in the same group respectively and separates the background and foreground by Bayesian based on the difference of the estimates. The codebook method proposed in the literature [11] makes full use of temporal context information. Based on the video sequence, a time-series model of each pixel is constructed using cluster statistical methods, and the foreground and background are distinguished by comparing the current pixel codebook with the established codebook. Considering the connection of adjacent pixels, many researchers consider using features such as regional edges and textures of local areas in the image for background modeling. The literature [12] proposed that after extracting local sub-region features of the image frame, the normalized vector distance method can be used to calculate the correlation between the sub-regions, and the foreground is separated by matching. The literature [13] proposed a background modeling algorithm based on LBP texture features, which can effectively deal with lighting changes in the environment, back static disturbance and other problems. This algorithm creates multiple LBP histograms with different parameters for each pixel in the image area and updates them dynamically. In addition, the literature [14] proposed an overall background modeling method based on video frames, such as a background reduction algorithm based on compressed sensing and a background restoration algorithm using a convex optimization method. In the case of background prior knowledge, the method is robust against background changes, so some scholars have been conducting research in recent years. The method based on object modeling is mainly suitable for detecting objects with relatively consistent outer contours such as pedestrians, cars, and faces. Compared with background-based modeling, this method has been developed late. Until the literature [15] proposed a face detection method based on a cascade structure, the detection method based on object modeling made breakthrough progress. The detection framework provided by this method achieves real-time detection of frontal faces, and the method uses Harr-like features to describe faces and uses a hierarchical structure to efficiently screen a large number of background windows. In the literature [16], the detection method proposed by using the Histogram of Oriented Gradients (HOG) feature and SVM classifier has also achieved good results in pedestrian detection. With the development boom of pedestrian detection, researchers at home and abroad have proposed more excellent detection algorithms in recent years. In the literature [17], the method proposed by using the HOG feature on the variable-size image area and the SVM classifier has significantly improved the detection speed when performing hierarchical structure detection. In the literature [18], when using the hierarchical training algorithm proposed by combining HOG, Edgelet features and covariance descriptors for detection, the hierarchical classifier corresponds to the local area of the window and the detection effect is better. The literature [19] combined LBP histogram features with HOG features to improve the description features. In complex scenes, pedestrians are more likely to occlude. In order to adapt to the situation of target occlusion or target attitude change, object modeling detection based on component models came into being. The literature [20] proposed a constellation model, which considered both the relative position information between components, the appearance and scale information of components. The deformable component model proposed in the literature includes a global model, a component model, and a deformation description model, which has a better effect in handling pedestrian occlusion and attitude changes. In recent years, learning algorithms commonly used by researchers at home and abroad for detection based on target modeling include neural networks, SVM, Adaboost, decision trees, random forests, and so on. With the development of deep learning methods, because deep learning methods naturally have powerful data expression capabilities, target detection based on deep learning models will inevitably achieve better detection results.
TLD algorithm based on HOG-SVM detection
The gradient histogram feature selected in this paper is an image descriptor that solves human target detection. This feature description method uses gradient histogram features to describe the human body and uses the collected human motion information and shape information to form a rich feature set. The HOG feature is a feature that describes the edge of a pedestrian derived from the scale-invariant feature transition. When extracting HOG features, first, the image needs to be divided into smaller connected regions, that is, cell units. Then, the gradient or edge direction of each pixel in each cell unit is collected, and the acquired histogram information is combined to form the required feature descriptor.
The HOG feature dimension of the target is related to the sample size. The larger the sample, the higher the feature dimension. FIG. 1 is a visualized gradient image of the HOG feature of a detection sample of size 720 × 960 and a schematic diagram of the sample feature in the field of view of the detection module. The visualized gradient image of features can intuitively reflect the richness of the feature information of the collected samples. The detection module view is a schematic diagram of the observation samples from the perspective of the HOG detection module.

HOG feature detection.
I represents an image, I (x, y) represents the gray value of the image at the pixel point (x, y), the detection window size of feature extraction is set to 64 × 128, and the cell size is 8.
Figure 2 is a flowchart of HOG feature extraction.

HOG feature extraction process.
(1) First, color and gamma normalization are performed on the image in gray space, RGB color space, and LAB color space. Considering that the target human body appears in different situations, performing normalization can improve the robustness of the detection module to changes in illumination.
(2) Then, the gradient value of each point in the image is calculated using the gradient template [- 1, 0, 1].
In the formula (1): x, y represents the horizontal and vertical coordinate of a pixel point in the image, and G x (x, y) and G y (x, y) represent the gradient values of the point in the x direction and the y direction, respectively.
(3) Then, the gradient intensity M (x, y) and gradient direction θ (x, y) of the pixel are calculated as follows:
In order to improve the noise resistance of the HOG feature, the gradient direction uses an unsigned range, and θ (x, y) is limited to the [0, π] range.
(4) [0, π] is evenly divided into 9 intervals:
Cubic linear interpolation is used to project each gradient direction angle to adjacent space in a certain proportion and assign corresponding weights. R-HOG rectangular intervals are used to combine cell units to form a new feature vector.
(5) Then, the L1-norm normalization process is performed to eliminate the effects of lighting noise and other features, and the characteristics of each small block are collected.
(6) Finally, all the feature vectors in the detection window are concatenated, and the final target feature vector is expressed as:
There will be various interpolations when actually forming high-dimensional description feature vectors.
The SVM classification process is a machine learning process, and the classification criteria originate from linear regression. The goal of SVM is to find a hyperplane w T x + b = 0 in the N-dimensional space of the sample points, which can accurately separate different types of data points.
The schematic diagram of the classification of linear separable two-dimensional data on a two-dimensional plane is shown in Fig. 3. The circles and boxes represent different types of data points, the middle solid line is equivalent to the classification hyperplane, and the categories corresponding to the data points on both sides are 1, and 1, respectively. The interval between the data points and the classification hyperplane is Gap/2.

Schematic diagram of SVM classification of linearly separable data.
When using SVM to classify data, the classification function of the hyperplane is:
The injective relationship of the category judgment of the new data point (x
i
, y
i
) is:
The objective function of the SVM model is:
The above objective function is equivalent to:
Considering the noise in the actual data, some outliers will appear. The objective function of equation (7) is transformed into:
In formula (8), C is a constant parameter, which is used to control the weight of two items of the objective function, and ɛ is a variable to be optimized. Since this objective optimization problem is a convex quadratic programming problem, the objective function can be transformed to its equivalent dual problem by adding a Lagrangian multiplier α to the constraints to facilitate the solution. Therefore, we set α
i
as the Lagrangian multiplier for each constraint, and define the Lagrangian function as follows:
In the formula (11), f (x) is a function to be minimized, g (x) is an inequality constraint, h (x) is an equality constraint, and p and q represent the number of inequality constraints and equality constraints, respectively.
For d, the optimal solution α* is solved by minimizing L (w, b, ɛ, α, r) with respect to w, ɛ and b and maximizing α with SMO algorithm.
Therefore, the final optimal classification function is:
It can be seen that the SVM classification method converts a non-linear separable problem into a linear separable problem by mapping samples in an N-dimensional space to a higher-dimensional space. In addition, the SVM classification model uses kernel functions for implicit mapping of the feature space, which can avoid direct calculation in high-dimensional space and simplify the calculation process.
The KCF algorithm constructs a large number of training sets by constructing a cyclic matrix after collecting the tiles to represent the samples that are densely sampled on the target and its background. The KCF algorithm uses a non-linear prediction (regularized least squares model) model based on ridge regression to train and learn to obtain a filtering template, that is, a classifier. The approximate degree of the candidate area and the tracking target is calculated by the kernel function, and the candidate area with the highest similarity is selected as the new tracking target, so as to detect the next frame. Ridge regression has a closed solution, and the KCF algorithm uses the nature of the cyclic matrix to reduce the amount of computation of the classifier during the training and detection process through fast Fourier transform, and the training speed is fast.
The goal of the KCF classifier training process in the linear case is to find the function f (s) = w
T
s to minimize the residual function in formula (13), that is, to find the optimal w:
Among them, X i is the feature of a single training sample. The KCF algorithm describes the human body through multi-channel gradient direction histogram features. y i is the corresponding sample label and obeys the Gaussian distribution. The KCF algorithm uses the value in the [0, 1] as the label value of the sample to give different weights to the samples obtained at different offsets. λ is the regularization parameter, which prevents overfitting. X = [X0, X1, ⋯ , Xn-1] is the sample feature set, Y = [y0, y1, ⋯ , yn-1] is the sample label set, and m is the number of training samples.
In the formula (14), the KCF algorithm training sample feature set is a cyclic matrix. The characteristics of the cyclic matrix in formula (14) make the closed-form solution of linear least squares easier to obtain, as follows.
In formula (15),
When formula (15) is brought into formula (16), the following formula is obtained:
After multiplying F by left on both sides, the following formula is obtained:
It can be seen from formula (18) that the solution of the weight vector w is transformed into the Fourier transform domain, and the cyclic matrix characteristics of the collected samples greatly reduce the calculation amount. The classification in a flat 2D field of view is generally a non-linear case. The RLS classifier established during the KCF sample training process is similar to the SVM classifier. It is based on a kernel function and uses a mapping function to map the feature product of the low-dimensional space to the high-dimensional space to solve the nonlinear problem. The RLS classifier is simpler. It can minimize a regularization function directly on the kernel-defined regenerating kernel Hilbert space. After the kernel function is introduced, the weight vector of the classifier becomes:
In formula (19), ɛ i is the coefficient corresponding to the training sample, and ɛ = [ɛ1, ɛ2, ⋯ , ɛ n ] is the new optimal parameter of the closed-form solution. φ (X) is a kernel function that maps training samples to a high-dimensional feature space, and k is a Gaussian kernel function. The correlation between any two multi-channel samples X i and X j in the high-dimensional feature space is:
In formula (20), δ represents the bandwidth of the Gaussian function. After formula (20) is brought into the final objective function, the following formula is obtained.
In the nonlinear prediction model, the parameters of the ridge regression solution based on the kernel function are:
In formula (22), K is a kernel function matrix composed of all training samples and is a cyclic matrix, K = C (K
xx
), and K
xx
is a vector consisting of K’s first row elements. By performing Fourier transform on equation (22), the following formula is obtained:
It can be seen that the classifier training process of the KCF target tracking algorithm is the process of finding the optimal ɛ. When the KCF target tracking algorithm detects a certain picture area u to be measured, the probability that the measured area is a tracking target is as shown in equation (24):
When the KCF target tracking algorithm performs fast detection in the newly input image area h, it also performs a cyclic shift on h to construct a sample set H,
Thus, the probability distribution that the input image h becomes a tracking target at all candidate region positions is obtained. In formula (26), K
xh
is a vector consisting of the first row elements of the K
h
-kernel function matrix.
f (h) is a 1 × n vector and its element values correspond to the probability values of all candidate regions becoming the tracking target, and the candidate region corresponding to the largest element value is the tracking target. When the KCF target tracking algorithm detects and tracks sequence frames, the parameter update model is:
In formula (27), ɛ t is obtained by training the target region obtained in the current frame detection, ɛt-1 is obtained by training the sample image of the previous frame, ɛt+1 is the model parameter used to detect the target in the next frame, and α is the learning rate of the updated model. xt+1xt-1 and x t are the target models obtained by updating the next frame, the previous frame, and the current frame, that is, fixed-size image blocks around the target. After obtaining a new target region, it is cyclically shifted to construct a new training sample set sum. Combining the training samples from the previous frame, the next frame is used to update the target model x used by the classifier to detect and track the target.
This chapter uses the experimental platform, experimental evaluation indicators, and data sets to perform pedestrian tracking experiments on the KCF-based TLD algorithm to verify the improved algorithm’s effect on the above problems. In the experimental tracking effect diagram, the black rectangular box is the real position of the tracked target, and the green rectangular box is the tracking result of the tracking algorithm on the target.
Figure 4 shows the tracking effect of the skating2 data set algorithm. After the first frame (Fig. 4a) is initialized, the tracking algorithm has a better tracking effect on the target and can still track the target well when the target moves, deforms, and occludes. At frame 168 (Fig. 4b), the target rotates while moving rapidly, and the tracking algorithm tracks the target, and at frame 228 (Fig. 4c), the tracking algorithm loses the target.

The tracking effect of the skating2 dataset.
Figures 5–10 are the tracking accuracy maps based on KCF tracking TLD algorithm on 6 data sets. As can be seen in the figure, the algorithm has the best tracking effect on the boy dataset. The tracking accuracy is 0.99 when the threshold is 20 pixels, the accuracy of the gym dataset is 0.95, the accuracy of the human2 dataset is 0.95, the tracking accuracy of the walking2 dataset is 0.71, the tracking accuracy of the basketball dataset is 0.60 and the tracking accuracy of the skating2 dataset is 0.61. In the basketball dataset and the human2 dataset, the tracking accuracy of the improved TLD algorithm has been significantly improved.

Precision plots of basketball.

Precision plots of boy.

Precision plots of gym.

Precision plots of human2.

Precision plots of skating2.

Precision plots of walking2.
Table 1 and Fig. 11 show the average number of frames tracked by the tracking algorithm on the dataset. The KCF tracking algorithm reduces the feature dimension of the tracking target and improves the speed of the tracking algorithm. As can be seen from the table, the tracking real-time performance of the improved TLD algorithm has been improved.
The average frame rate of the tracking algorithm on the dataset

The average number of frames tracked by the tracking algorithm on the dataset.
The improved algorithm can better deal with lighting changes and target occlusion. In order to verify the algorithm in this paper, we choose different conditions to perform comparative experiments on the same experimental data. Table 2 lists the basic conditions of the four sets of experimental data. In video 1, the target pedestrian moves under normal outdoor lighting, and there is no motion obstruction or severe deformation during the whole process. In Video 2, the environment where the target pedestrian is located is dim. In Video 3, the environment where the target is located is not only dimly lit, but the environmental background is extremely complicated. The pedestrian target has poor recognition in this environment. In Video 4, the target pedestrian’s environment is poorly lit, and there is a serious occlusion from the background environment. During the period, the target is even in a state of disappearance.
Experimental video data description
In order to verify the algorithm in this paper, we also choose a comparison algorithm using different algorithms for experimental comparison with the algorithm mentioned in this article. The first algorithm is the original TLD algorithm. The second algorithm is the K-TLD algorithm, which means that only the Kalman filter model is added on the basis of the original algorithm. The third algorithm is the M-TLD algorithm, which indicates that only the Markov direction model is added on the basis of the original algorithm. The fourth is the algorithm mentioned in this article, which means that several algorithms are merged, and occlusion judgment is added. Table 3 shows the average results of 30 tracking tests of the four algorithms. The experimental results show that under normal lighting, the four algorithms can track the target normally, but the K-TLD algorithm and the M-TLD algorithm have some missing follow-up phenomenon. The reason is that these two algorithms need to re-detect the target when the target is deformed or occluded. However, the processing speed is reflected in the tracking frame rate, and the processing speed of the algorithm by adding Kalman filter and Markov model is significantly faster. However, the new N-TLD algorithm in this paper not only has a fast processing speed, but also can estimate and track the target better when the target is severely occluded. The experimental data show that the algorithm is extremely robust and can meet the tracking process under different requirements.
Video sequence test results
After calling the target detector, first, the coordinates of the target area and the Tracker are matched by the spatial distance matching method. Trackers and targets are generally distributed in a one-to-one correspondence. If after matching according to a predetermined position in space, some targets being tracked are blocked by the background or other tracking targets, this means that the Tracker data is redundant and the distance between the target trackers is greater than a set threshold. At this time, the results tracked by the detector need to be updated to the matched targets and updated one by one according to the tracker results. However, there are more complicated situations that may cause imbalance in matching. The first is that the Tracker target is not matched, and the second is that a single target is matched by multiple Trackers at the same time.
During the matching process, there are often trackers whose identity cannot be determined. There are two methods to determine its ownership: one method: if there are hidden and occluded tar-gets around the target, in this case, it can be determined by the position of the target; another method: If the target is a newly entered target, other targets that are normally tracked will not be disturbed. If neither of these methods can be determined, we need to confirm whether the target is a hidden target to reproduce or to occlude and separate the target. Only through these checks and screenings can we finally confirm the tracker ownership of the target. The TLD detector needs to save the historical feature vector of the target. The purpose of this is to separate the situation where the two goals intersect. Generally, this happens when the Tracker moves, or the tracking target changes from hidden to restored. In this case, characteristics such as the size of the tracking target, the feature description vector of the target, and the spatial position of the target are used to joint decision-making to improve the accuracy of the matching.
Conclusion
In order to improve the feature recognition effect of sports games, based on the TLD algorithm, this study combines image recognition and feature extraction algorithms to perform feature recognition. Moreover, this study applies the TLD algorithm to long-term pedestrian tracking of PTZ cameras. In view of the shortcomings of the TLD algorithm, this study improves the TLD algorithm. In addition, the improved TLD algorithm is experimentally analyzed on a standard data set, and the improved TLD algorithm is verified by real-world experiments on a PTZ camera. Based on the KCF tracking algorithm, the TLD algorithm is improved, and the improved tracking algorithm is analyzed experimentally. The results show that the TLD algorithm based on KCF tracking can well respond to environmental interference and target occlusion, and the overall effect of target tracking is better, and the tracking accuracy is high. Finally, this paper improves the TLD algorithm, uses HOG-SVM detection to improve the detection part of the TLD algorithm, and uses KCF tracking algorithm to im-prove the tracking part of the TLD algorithm.
