Abstract
Under the influence of novel corona virus pneumonia epidemic prevention and control, higher requirements for behavior recognition in complex environment are put forward. The accuracy of traditional methods for sports training is not high, so a method is needed to improve the local action recognition to assist sports training. In the process of behavior recognition, if only the track is regarded as an independent individual, the information of its neighbor will be ignored. Therefore, we use KNN algorithm to get the nearest neighbor trajectory. In order to calculate the rich neighborhood information around the track, this paper calculates the complex relationship between the center track and the neighborhood track from four different angles, including absolute motion, relative motion, distance relationship and direction relationship. Then, from the four different perspectives of variance, discrete coefficient, skewness and kurtosis, this paper proposes a large interval nearest neighbor coding method. This method makes the four eigenvalues complement each other and improves the ability of describing complex and changeable behaviors. The experimental results show that the coding method proposed in this paper can be used for behavior recognition according to different transformation matrix.
Introduction
Human behavior recognition is one of the most important research directions in machine learning. It has made rapid development in both academic research and practical application. At present, human behavior recognition and video classification technology has a wide range of applications in many intelligent industries, and has a broad market space and use prospects [1, 2]. Especially for the sports recognition of sports training, its recognition accuracy determines the adjustment of training means and affects the training results of athletes. However, the accuracy of traditional methods for sports training is not high, so it needs a method that can improve the local action recognition to assist sports training.
In the whole development process of behavior recognition, because the number of local features extracted in each video may be different, it is necessary to make each video representation have the same feature dimension by coding. Especially after the behavior recognition method based on local feature has achieved more and more excellent results, this feature is more and more obvious. This feature is based on local dense tracks. Because the number of dense tracks acquired in each video is usually different, it cannot be directly used for SVM classifier learning and classification. In order to solve this problem, local features need to be further encoded to obtain video feature vector representation for behavior recognition [5–10]. Some coding methods usually use clustering methods to obtain basic code words first, and then use various methods such as hard allocation, soft allocation and probability model to generate coding results. However, these methods often ignore the following two problems in the coding process. On the one hand, there may be a fuzzy boundary between clustering clusters obtained by the same category, which makes the unknown samples ambiguous and confusing when projecting. On the other hand, due to the similarity of motion or extracted features of different categories, especially for two types of behaviors that are easy to be confused, the clustering results generated by each category must have greater similarity, which makes it difficult to distinguish the coding features of the two types of behaviors [11].
In order to solve these two problems at the same time, this paper proposes a coding method based on large interval nearest neighbor, which processes different clusters of the same category and clusters of different categories simultaneously through the large interval nearest neighbor algorithm, thus learning a new distance measure, making them distance apart as much as possible in a new conversion space, and projecting by using the new distance measure, thus improving the accuracy of projection and reducing confusion [12]. Finally, we further learn different transformation matrix from the characteristics of the four angles of the temporal and spatial neighbor distribution descriptors between tracks, and use the results of coding with large interval nearest neighbor coding method to combine them for behavior recognition.
Near neighbor algorithm for track spatial temporal characteristics
First of all, we use the large interval nearest neighbor algorithm to solve the problems of fuzzy boundaries of different clusters in the same category and confusion of similar clusters in different categories at the same time by learning a new distance metric and converting different clusters of the same category into a new space. Finally, we use the learned new distance metric to project, and use the algorithm to improve the accuracy of finding neighbors. The projection process can be regarded as the process of finding nearest neighbors in the set, further improving the projection accuracy and obtaining better coding results.
Large interval nearest neighbor algorithm
Large interval nearest neighbor algorithm is a machine learning algorithm used in k-nearest neighbor classification, which can be regarded as the optimization of KNN classification, and it can improve the accuracy of k-nearest neighbor classification. This algorithm transforms the samples in the original space into a new transformation space, and achieves the following optimization objectives by learning a distance measure: The k nearest neighbors of an input x i are all of the same category, while the samples of different categories keep a certain distance from x i as much as possible. The main task of the algorithm is to find the transformation matrix M which satisfies the conditions, so that the distance measurement between two samples can be transformed from the original Euclidean distance to the new distance measurement dM. Euclidean distance is a special case of the distance measurement.
The application of the large interval nearest neighbor algorithm first needs to obtain the class label of the sample, its target neighbor and the intrusion sample. For the input sample x i , its class label is represented as y i . In the k nearest neighbors of x i , the label with y i is regarded as the target neighbor. In the transformation space, the distance between the target neighbor and the input sample is as close as possible, while the difference between the label and y i is regarded as the intrusion sample. In the transformation space, the distance between the intrusion sample and the input sample is as far as possible, and finally the neighbor of each sample is its own similar sample, while the samples of different classes are far apart. The learning effect of the algorithm is shown in Fig. 1.

Transfer matrix learning effect.
The principle of the large interval nearest neighbor algorithm is described in Ref. [13]. The next key problem is how to use this algorithm to code, and realize the separation of clustering clusters with fuzzy boundaries of the same category and similar clusters of different categories at the same time, so as to reduce the confusion in the process of sample projection and improve the accuracy of coding.
The coding process of large interval nearest neighbor can be divided into the following steps: assigning categories and acquiring labels, learning new distance measures and acquiring transformation matrix M, projecting in new transformation space, and the specific process is as follows:
Step 1: The spatial-temporal neighbor distribution descriptors between tracks we obtained first describe the tracks of each video, and cluster the characteristics of each behavior category separately. The cluster number is set to S, and each generated cluster is regarded as a new category. The samples belonging to each cluster have the same category labels. The cluster clusters of all behavior categories are labeled in turn. If the number of behavior categories is set to Z, the number of categories allocated to the generated cluster is Z×S.
The details are as follows: assuming that the feature set of each category is {x1, x2, ... ,xm ... ,xp}, xm∈ RD, and K-means clustering algorithm is used to cluster the feature set, then the number of tags of this category should be s, and the cluster cluster is expressed as: {c1, c2, ... ,cs, ... ,cS}. According to the clustering results, the corresponding variable rms ∈ 0,1 is obtained for the feature vector xm in the set. If xm belongs to the cluster s, then rms =1 and rmj =0, j ¡Ù s. We minimize the following objective functions:
We can get {rms} and {cs} by iterative optimization. If each cluster is regarded as a category, then we can get all the cluster center sets D={c1, c2, ... ,cs, ... ,cS}.
Step 2: Learn the new distance metric to obtain the transformation matrix M. We use step 1 to obtain the labels corresponding to all samples and all samples corresponding to labels of each category. Then we use the strategy of random sampling to randomly select O samples in each category. The matrix of the selected samples is expressed as F with the size of O×S×d and d as the characteristic dimension. The corresponding category label is expressed as LF with the dimension of O×S. F and LF are used as the inputs of the large interval nearest neighbor algorithm to learn the transformation matrix M. The specific equation is as follows:
Step 3: In the original space, the distance between samples is calculated by Euclidean distance, and the distance measurement between samples changes after the samples are transformed to a new space by learning the transformation matrix M. For all samples of each behavioral video, we use the distance metric we learned to calculate the distance between each sample and the center of each category, and the equation is as follows:
For each video, the number of samples projected under each label is counted, and finally the coding results of this video are obtained after normalization. Since the number of tags is the same, the encoded video vectors have the same dimension. The following Fig. 2 vividly shows the effect of this paper’s large interval nearest neighbor coding, in which the first line represents the method of this paper’s distribution of categories and labels, and the second line shows that the boundary of clustering clusters of the same category in the clustering results is fuzzy, and the specific performance is that although clustering clusters are separated, however, it is obvious that there are samples of other clusters in the neighbor of the samples at the boundary of one cluster. In other words, the two clusters are not separated far enough, which is not conducive to the projection of unknown samples, while the third row shows the case of large similarity between clusters of different categories. In this case, if the similarity between categories is not considered, the clusters of different categories will be overlapped, which is very easy to cause confusion. The coding method proposed in this paper is good at handling both cases.

Large Interval nearest Neighbor Coding.
As shown in Fig. 3, the behavior recognition framework in this paper is divided into three parts: velocity acceleration co-occurrence descriptor construction, spatial-temporal neighbor distribution descriptor construction between trajectories, and large interval nearest neighbor coding and classification.

The framework of behavior recognition.
Step 1 Construction of velocity and acceleration co-occurrence descriptors. First, we extract the initial dense trajectory, then calculate the discrete coefficients of the trajectory from the appearance and motion angles to express the importance of the trajectory. Finally, we screen the first M% of the trajectory for feature description. In this paper, we use the co-occurrence statistics of velocity and acceleration to obtain the relationship between velocity and acceleration to mine the motion trend information, and obtain a discriminant co-occurrence descriptor of velocity and acceleration that can more accurately describe the behavior motion state.
Step 2: construct the spatial-temporal neighbor distribution descriptor between tracks. Firstly, the KNN algorithm is used to obtain the neighbor trajectories of each center trajectories, and then the relationship between trajectories and their neighbors is described from four aspects, including absolute motion, relative motion, distance relationship and direction relationship. Finally, nine measures are used to describe the neighbor information in three aspects of statistics, so as to construct the temporal and spatial neighbor distribution descriptor in this paper.
Step 3 code and classify the nearest neighbor with large interval. Using the spatial-temporal nearest neighbor distribution descriptors proposed above, we can learn the transformation matrix from four different angles by using the large interval nearest neighbor algorithm, and then combine the results of each angle coding as the final representation of video input to the support vector machines, SVM) and through training and testing in SVM, the behavior recognition results of this paper are finally obtained.
In this paper, four commonly used standard data sets are Weizmann, KTH and UCF-Sports, and Youtube Video Database. The first two databases are simple behavior background, and the last two are complex behavior background. The above four databases verify the effectiveness of the space-time neighbor distribution descriptors, proposed coding and proposed framework.
The public database
(1) Weizmann video library
Weizmann video database contains 10 different behavior categories. In this database, 9 different individuals, such as clothing and gender, perform 10 kinds of motion behaviors respectively. Moreover, the shooting scenes of this database are static, without background interference and camera jitter. The behavior categories include: bending down, walking, and running, galloping sides, jumping in place on two legs, Jumping Jack, skipping, jumping forward on two legs, waving one hand and waving two hands. This database is a typical database with simple background and simple behavior, and the behavior video included is shown in Fig. 4.

An Example of Behavior Video in Weizmann Database.
(2) KTH database
KTH database contains 6 kinds of relatively simple behavior videos, and each kind of video is generated by 25 people performing similar actions in 4 scenes respectively. The shooting environment of the database includes: The shooting in indoor environment; The shooting in basically static outdoor environment; The shooting in outdoor environment with telescopic changes of cameras; The shooting in outdoor environment with different costumes.
In addition, the characteristics of this database are that the specific motion time and shooting angle of different videos have certain differences, and the camera has slight jitters. KTH database is a common database for experimental verification of behavior recognition results. The behavior categories in this database include running, jogging, walking, and hand waving, hand clapping and boxing, as shown in Fig. 5.

Behavior video example of KTH database.
(3) UCF sports database
UCF sports database includes 10 kinds of sports video clips with the size of 720×480. The videos in the database are mainly common sports video clips, which not only have relatively complex background, but also have complex and changeable behavior. These types include: running, walking, swinging at the high bar, kicking a ball, diving, weight lifting, pommel horse and on the floor, horse riding, golf swing and skateboarding. The database has 150 video clips in total, which is a typical database with complex behavior and complex background. The video example is shown in Figure 4.6.
(4) YouTube database
The behavior videos in YouTube database are shot in 25 different scenes, among which 11 kinds of video behaviors are as follows: biking, soccer jumping, horseback riding, trampoline jumping, basketball shooting, tennis swing, swing, golf swinging, volleyball spiking, walking with a dog and diving. The video size of the database is 320 × 240, and other typical features include not only the scene is no single, the background is messy, the shooting angle and the shooting quality are different, etc. in addition, some videos are shot in the camera static state, and the other part has the camera motion. The specific video examples in the YouTube database are shown in Fig. 7.

Behavior video example of UCF-Sports database.

An example of YouTube database behavior video.
The experimental setup of this paper is as follows: the length of the track and the length of the track cub. The selection size of the rectangular box centered on the track is N = 32, the track tracking length is L = 15, the number of scales is Q = 8, the sampling interval is w = 5, and the number of track sub blocks is c = 12. Since the number of tracks of different videos in different databases has different orders of magnitude, M in track screening needs to be taken according to the actual situation. the specific way is to assume T is the actual number of tracks in a behavioral video. For Weizmann library, if T < 2000, set M = 90; otherwise, set M = 50. For KTH library, if T < 5000, set M = 60; otherwise, set M = 20. For UCF-Sports library, if T < 5000, set M = 60; otherwise, set m = 10. For YouTube library, if T < 5000, set M = 60, if 5000 < T<20000, set M = 10; otherwise, set M = 5.
In the construction of the velocity acceleration co-occurrence descriptor, the quantization directions of velocity v are set to 4 and 8 respectively, and the quantization directions of acceleration is also set to 4 and 8 respectively. Four different combinations of the quantization directions of velocity v and acceleration are used in three types of videos in UCF-Sports database for behavior identification. The result is as shown in Fig. 8, the quantization direction of final velocity v is set to m = 8, and the quantization direction of acceleration is set to n = 8.

Influence of velocity and acceleration quantization direction on recognition results.
In this paper, the number of nearest neighbor trajectories K in KNN algorithm is selected based on experience. Because the number of trajectories in Weizmann library is small, it is set as k = 30 in Weizmann library, and the estimated number of other databases is large, all of which are set as k = 100. In the process of estimating the average speed of the trajectory, the boundary length of the three regions A, B and C is set to 16, 28 and 32 respectively, and the weight is w1 = 1/2, w2 = 1/3, w3 = 1/6. All the feature descriptors constructed in this paper are the results of L2 norm normalization. When BOW encoding the feature, the number of codewords of the feature is set to 4000. When coding features with large interval nearest neighbors, O = 3 is set. In order to further explore the influence of S on behavior recognition results, we set S to 10, 30 and 50 respectively to code the absolute motion in spatial-temporal neighbor distribution descriptors, and apply the coded results to behavior recognition. The recognition rates obtained are 93.33%, 93.33% and 94% respectively. Finally, we set S = 50.
In this paper, linear SVM classifier is used for training and testing in behavior recognition. In addition, the way of data set partition in different databases is different. In Weizmann, UCF-Sports and YouTube databases, this paper uses the method of leaving one person for data set partition, but it is different. Specifically, Weizmann uses the method of leaving one person for cross-validation, UCF-Sports uses the method of leaving one person for cross-validation, and YouTube uses the method of leaving one group for cross-validation. In addition, a standard 16 : 9 partition method is adopted on KTH data sets, of which 16 are used for training and 9 are used for testing.
In order to illustrate the effectiveness of the velocity acceleration co-occurrence descriptors and the spatial-temporal nearest neighbor distribution descriptors proposed in this paper, the feature descriptors proposed in this section are used for behavior recognition and compared with the behavior recognition results of other feature descriptors.
In order to obtain the behavior recognition results of the velocity acceleration co-occurrence descriptor and the spatial-temporal neighbor distribution descriptor in each data set, firstly, construct the velocity acceleration co-occurrence descriptor according to each trajectory in the behavior video, construct the spatial-temporal neighbor distribution descriptor according to the construction, then respectively use k-means clustering to obtain code books and perform BOW coding, obtain the feature vectors encoded by each behavior video, input them into SVM for training and testing, and respectively obtain the recognition accuracy rates of the velocity acceleration co-occurrence descriptor and the spatial-temporal neighbor distribution descriptor in each library.
In the following Table 1, the recognition rates of the features and comparison algorithms in kth, UCF sports and Youtube are listed respectively. In this section, in order to highlight the effectiveness of the descriptors proposed in this paper, the recognition rates of the velocity acceleration co-occurrence descriptors and the spatiotemporal nearest neighbor distribution descriptors listed in Table 1 are the recognition results encoded by the traditional bow coding method.
The behavior recognition rate
The behavior recognition rate
The following rules can be seen from Table 1: Firstly, the co-occurrence descriptors of velocity and acceleration obtained by co-occurrence statistics of velocity and acceleration in this paper have better recognition results than those constructed by using velocity or acceleration alone. Among them, HOA represents the optical flow acceleration histogram and HSGA represents the acceleration spatial gradient histogram calculated further by the author on the basis of HOA. These two features belong to the features constructed by using acceleration information. In the literature, the author gives the results of the features on the original dense trajectory and the edge trajectory respectively, in which HOF feature represents the histogram of optical flow direction, which belongs to the feature constructed by using velocity information, while HMA represents the motion acceleration descriptor which belongs to using acceleration information. Since no experiments have been conducted in Weizmann library, we only compare the last three libraries. In KTH library, the co-occurrence descriptor of velocity acceleration in this paper achieves 95.485% accuracy of behavior recognition. In UCF-Sports library, the co-occurrence descriptor of velocity and acceleration achieves 90% of the behavior recognition accuracy, while in YouTube library, the co-occurrence descriptor of velocity and acceleration achieves 84.97% of the behavior recognition accuracy. The reason for the analysis is that the velocity and acceleration co-occurrence descriptors put forward in this paper take the velocity and acceleration information as co-occurrence statistics. Compared with taking the velocity and acceleration as independent features, using the relationship between them can obtain the motion trend information to describe the movement of the object more accurately, thus effectively enhancing the ability of the features to describe the behavior and improving the recognition results. Secondly, the spatial-temporal neighbor distribution descriptors proposed in this paper are higher than the velocity acceleration co-occurrence descriptors proposed in this paper. The analysis is that, compared with the velocity acceleration co-occurrence descriptors, the spatial-temporal neighbor distribution descriptors between trajectories consider not only the motion and position information of the trajectory itself, but also the neighbor information around the trajectory, thus obtaining more information conducive to behavior recognition, thus describing the trajectory more comprehensively and effectively, so the behavior recognition results are better. Finally, compared with the feature descriptors constructed by the comparison method, the velocity acceleration co-occurrence descriptors and the spatiotemporal nearest neighbor distribution descriptors proposed in this paper have better results, which proves the effectiveness of the features proposed in this paper for behavior recognition.
The recognition results of the space-time nearest neighbor distribution descriptors proposed in this paper using the traditional bow coding and the large interval nearest neighbor coding method is shown in Table 2. Except Weizmann library, the recognition accuracy of this paper is 100%, and the coding methods in KTH, UCF-Sports and YouTube databases are 0.67%, 2% and 1.82% higher than BOW coding methods in turn. We can find that the large interval nearest neighbor coding method proposed in this paper has better effect. This result verifies that the proposed coding method can reduce confusion and optimize coding effect in the coding process, thus further improving the result of behavior recognition.
The comparison of coding methods and BOW coding in four databases
The comparison of coding methods and BOW coding in four databases
In this paper, we first introduce the large interval nearest neighbor coding method, and improve the projection accuracy by learning new distance measurement. Then we describe the behavior recognition framework proposed in this paper. We use the information from four angles of the spatial-temporal neighbor distribution descriptors between tracks to learn different transformation matrices by using the proposed large interval nearest neighbor method, combine the vectors encoded by the large interval nearest neighbor coding method in this paper as video representation, and input the video representation into SVM classifier for recognition and classification. Finally, the simulation results of the proposed method on four databases and the comparative experiments of other advanced methods show the effectiveness of the proposed method.
