Abstract
Human activity recognition is a key technology in intelligent video surveillance and an important research direction in the field of computer vision. However, the complexity of human interaction features and the differences in motion characteristics at different time periods have always existed. In this paper, a human interaction recognition algorithm based on parallel multi-feature fusion network is proposed. First of all, in view of the different amount of information provided by the different time periods of action, an improved time-phased video down sampling method based on Gaussian model is proposed. Second, the Inception module uses different scale convolution kernels for feature extraction. It can improve network performance and reduce the amount of network parameters at the same time. The ResNet module mitigates degradation problem due to increased depth of neural networks and achieves higher classification accuracy. The amount of information provided in the motion video in different stages of motion time is also different. Therefore, we combine the advantages of the Inception network and ResNet to extract feature information, and then we integrate the extracted features. After the extracted features are merged, the training is continued to realize parallel connection of the multi-feature neural network. In this paper, experiments are carried out on the UT dataset. Compared with the traditional activity recognition algorithm, this method can accomplish the recognition tasks of six kinds of interactive actions in a better way, and its accuracy rate reaches 88.9%.
Keywords
Introduction
As a research hotspot in the field of computer vision, human activity recognition technology has broad application prospects in many fields such as human-computer interaction [1, 2, 3, 4, 5, 6, 7], virtual reality [8] and motion analysis [9, 10, 11, 12, 13, 14]. In recent years, researchers are working to solve the difficult problems in this area, and it promotes the rapid development of this technology at the same time. Among them, how to effectively and quickly extract the feature information in the data becomes a big difficulty.
Related work
In the feature extraction, the global feature contains more information about the human body, and feature descriptor is sensitive to changes such as noise and occlusion phenomena. Local features such as space-time points of interest, which is independent of moving human body segmentation and tracking, are not very sensitive to noise and occlusion, but it is also difficult to extract stable space-time points of interest. Algorithms based on the combination of global and local features by combining the advantages of both are more common. Meng et al. [15] used different pose estimation to train different kinds of action models by detecting the position of joint points. They also extract the semantic spatial relationship between people, and combine the appearance features to realize human activity recognition. Wang et al. [16, 17] used the optical flow field to obtain the trajectory features in the video sequence. Then they extracted the optical flow direction histogram, the motion direction histogram, and the motion boundary histogram feature to obtain the motion descriptor. It can improve the activity recognition accuracy. Different from RGB image data, there are some algorithms based on depth image input information of 3D joint points. Vemulapalli et al. [18] used rotation and translation in 3D space to model the geometric relationship of various parts of the body. With the development of deep neural networks such as convolutional neural networks and long- and short-term memory networks, human activity recognition has gradually made greater progress, and algorithm performance has surpassed traditional algorithms. Simonyan et al. [19] proposed a dual-flow CNN network structure. In dual-flow CNN network structure, spatial stream processes a single frame image and time stream processes continuous multi-frame dense optical flow information. The dual-flow CNN network is trained by the space and time two dimensions of network training, and finally the results are fused and classified. In order to solve the problem of long time span of video sequences, Wang et al. [20] proposed a time segmentation network which is based on the dual-flow method. The network performs motion analysis on the overall motion video which is based on uniform sparse sampling and video level supervision. Ng et al. [21] combined the CNN network with the LSTM network to improve the performance of the classification algorithm by effectively expressing the sequence of frames through the memory unit. Tran et al. [22] proposed a new C3D network to extract the spatial temporal characteristics of video, and achieved good experimental results. The proposed 3D convolution can capture the timing information well, which inspired many researchers. Chiang et al. [23] proposed the 2D low-complexity hand gesture identification technology based on 3D depth information. The proposed design overcomes the difficulties to separate the integrated palm region from the complex background. They use time-view motion tracking to identify various human behaviors. Chen et al. [24] made the convergence speed as fast as possible, and they proposed a Multi-Fiber network. This splits the complex network into a lightweight network and uses the information flow between the fibers to introduce the multiplexer module. Wang et al. [25] are inspired by the classic non-local means in computer vision. Then a non-local operation is proposed as a general building block series for capturing remote dependencies. This method mainly studies the connection between two pixels with a certain distance on the image, and this method also studies the connection between video frames.
We focus on the study of human interaction recognition. Due to the differences of activity characteristics in different time periods of video, an improved down-sampling method based on Gaussian model is proposed. This method can extract video keyframes and remove the influence of a large amount of redundant information. Due to the high dimensionality of feature information, feature extraction based on parallel multi-feature fusion network is proposed to obtain feature information.
Materials and methods
In this paper, different sampling frequencies are used for the difference of motion characteristics of different time periods of video. We propose an improved fusion time-phase feature of Gaussian model to obtain video keyframes and remove the influence of a large amount of redundant information. Aiming at the complexity of human interaction features, we propose a human interaction recognition method based on parallel multi-feature fusion network from the perspective of multi-feature fusion. By transfer learning, the Inception network and the residual neural network (ResNet) are used to extract the information features. The extracted preliminary features are combined in parallel and continue to train, and the interactive action classification results are obtained. The overall experimental block diagram is shown in Fig. 1.
Human interaction recognition of parallel multi-feature fusion network.
Different video images have different sizes, different resolutions, uneven illumination, etc., which will affect the accuracy of human activity recognition. In order to reduce the adverse effects, it is necessary to perform pre-processing operations on the video image data.
In this paper, we use the UT dataset for experiments. This article selects the UT database. In order to adapt to the current requirements of large amounts of data and improve the robustness of the human motion recognition training model, it is necessary to perform data enhancement on the dataset.
Image data enhancement means taking measures to obtain more training data while retaining the original image data category labels. Common data expansion methods include image flipping, image rotation, image translation, image cropping and image scaling. The method of data expansion in this paper mainly uses image flipping, image cropping and image scaling. Flipping is to swap the positions of the two sides of the symmetry axis along the symmetry axis. Flipping is divided into horizontal flipping and vertical flipping. Random cropping is to randomly select the position of the image in the video within a certain range. Image scaling means scaling a part of the original image to another scale.
Taking into account the characteristics of motion video, we mainly use horizontal flipping and random crop to expand the amount of data. The horizontal flipping along the middle symmetry axis will double the amount of data exchanged between the left and right sides. Random cropping means randomly selecting the position of the image in the video within a certain range to generate more activity video segments. The partial processing results are shown in Figs 2 and 3.
Image horizontal flipping example.
Image random cropping example.
In the field of computer vision, image normalization is more important in preprocessing. It can prevent the effects of affine changes and geometric changes, and accelerate the speed of network convergence. For training data, this paper uses zero-mean normalization method for data standardization processing. After the zero-mean normalization process, the influence of different illumination can be weakened to a certain extent, and the convergence of the network during the training phase is accelerated. The zero-mean normalization processing data method is as shown in Eq. (1). In the formula,
In order to facilitate video processing and reduce data redundancy, it is necessary to downsample the interactive motion video to make it have more regular and identically distributed. In this paper, according to the time period of the action video, the motion video is divided into an action start period, an action execution period and an action end period. Among them, in the action start period and the action end period of each human body interaction video, the interacting individuals are mostly in separate parts of the frame, and the feature differentiation degree of different actions is relatively small. For example, in the beginning of the action of shaking hands and hugging in the video, the interactive individuals are all moving from the left and from the right middle part of the frame, and the motion discrimination degree is small. In the period of interactive action execution, the characteristics of different actions are quite different. The feature information of this stage plays an important role in motion recognition. In response to this, we propose an improved downsampling method based on Gaussian model.
Gaussian probability density function distribution map.
As shown in Fig. 4, the Gaussian function is also called a normal distribution, and the expression of the probability density function is as shown in the Eq. (2).
In the formula,
After the video data is preprocessed, feature extraction of the human interaction video is performed. For the UT dataset, despite a series of data enhancement operations in the early stage, the amount of human interaction data is still limited. In view of this situation, we adopt the method of transfer learning, which can greatly reduce the workload and achieve better training results.
In this paper, a convolutional neural network based on parallel multi-feature fusion is proposed for the extraction of interactive feature information, and the migration learning method is adopted in the feature extraction processing [26].
The block diagram of multi-feature fusion network algorithm based on Inception and ResNet parallel connection is shown in Fig. 5. First, we take the preprocessed video sequence as input into the parallel multi-feature network. Then, we import the obtained feature information into the multi-layer perceptron module. Finally, we put the obtained result to the classifier to get the final classification result.
Block diagram of parallel multi-feature fusion network algorithm.
Inception V3 [27] network is an important breakthrough in the history of convolutional neural networks. This paper uses Inception pre-training network to obtain feature information. In general, in order to improve network performance, the most common way for researchers is to increase the depth and width of the network, but this approach will generate huge amounts of parameters. As the number of network layers increases, the training process will be more cumbersome and will easily lead to over-fitting of data. In order to ensure the performance of computing while expanding the network, Inception network emerges. As shown in Fig. 6, this network structure clusters the sparse matrix into more dense sub-matrices to improve the computing performance. The depth and width of the neural network are modified. The large convolution kernel is split into small convolution kernels of different sizes such as 1
Inception module.
The Inception structure uses different receptive fields to fuse different scale features to get more useful and reliable information. At the same time, these networks eliminate the fully connected layer in the network output layer. This method greatly reduces the parameters of the entire network and improves the efficiency of the training process. With the rapid development of deep learning technology, Inception network is constantly improving and innovating. For example, Inception V2 network introduces batch normalization layer, and uses two 3
We continue to explore the feature extraction method based on residual neural network [28]. As the convolutional neural network gets deeper and deeper, a series of problems have emerged. In the process of network training, as the number of network layers deepens, gradient vanishing or gradient explosion may occur. These problems make the network difficult to converge. Researchers hope to increase the nonlinear capability of neural networks by increasing the number of neural network layers. At the same time, they are trapped by the phenomenon of network degradation.
There is a problem of network performance degradation which is caused by the increase of the number of neural network layers. After the residual neural network proposed, the problem is improved and solved. Figure 7 shows the principle framework of the deep residual network. Different from the convolutional neural network structure in the past, this residual structure adds a short-circuit connection to the forward neural network. And it directly transfers the input information to the subsequent layer by skipping one or more layers of the network, simplifying the learning objectives of network.
Residual learning: a building block.
There is a residual unit that can be expressed in the form of Eqs (3) and (4):
In the formula,
Similarly, when it is to the lth layer, the Eq. (6) can be obtained.
It can be seen that when the number of layers is deeper and deeper, the output value of the network is related to the output of the residual in the previous layer. And the output of the Lth layer is the sum of the output values of the residuals of the previous layer. In backpropagation, calculating of the partial derivative of the loss function
It can be seen from the formula that in the process of gradient derivation, the case that the possible value is 0 in the multiplicative state is avoided, and the trouble caused by the disappearance of the gradient is excluded. After the network is improved in this way, even if the network is deeper, the results of network training will not be biased too much. This improving avoids the problems of gradient disappearance which is caused by the increase of the number of network layers, so that the network maintains good performance.
Taking the two-way interactive video in the UT dataset as an example, we use the Inception V3 pre-training model that based on the ImageNet dataset. According to the requirements of the model, the image in the activity video is adjusted to an image of size 299
The proposed network is a multi-feature fusion convolutional neural network based on Inception and ResNet. At the feature extraction level, feature extraction is performed using Inception V3 network and ResNet respectively. Then the two extracted features are fused at the fully connected layer. After that continue training, after multi-feature network convergence, the image features will be more abundant. It is conductive to improving the accuracy of human interaction video classification and recognition. According to the name of the video, we fuse two feature files generated by each video file to form a feature file with a size of 4096
After parallel feature extraction, further training and classification processing are performed. According to repeating experiments, the results show that in this experiment, when the number of fully connected layers is two, the activity recognition model obtained by training has the highest accuracy. Therefore, in this experiment, the feature fusion is connected to a multi-layer perceptron consisting of two layers of fully connected layers. And the Softmax classifier is selected for the classification of human interaction.
For a multi-layer perceptron, each layer consists of many neurons. The neurons of the adjacent layers are fully connected. The output of the hidden layer can be expressed as Eq. (8).
Here,
In the training process, the ReLU activation function is used to introduce the nonlinear relationship. Because of its simple form, the calculation amount can be reduced. And the phenomenon that the gradient disappears easily in the network propagation process can be alleviated. The ReLU function is shown in Eq. (10).
At the same time, in the network training of this paper, Adam algorithm [29] is selected to correct the learning rate. As for the Adam algorithm such as Eq. (11), it can be seen that this automatic optimization algorithm can correct the offset value and has high learning efficiency and less complexity. The first order momentum is expressed in
In order to prevent data over-fitting, the dropout layer [30] was introduced into the network to reduce the co-adaptation relationship between neurons. In the last layer of the network, the output of the fully connected layer is fed into Softmax layer, so that each human interaction video passes through the classifier and has a prediction result containing probability significance.
Record training set as
Results
Experimental platform and experimental data
This article is experimenting on a computer with NVIDIA GTX 1080Ti graphics card. The deep learning network framework uses Tensorflow, the programming language is mainly Python language. And other data analysis and pre-processing stages are used in Matlab language. The experiment is conducted on the Pycharm2018 and Matlab2016 platform. This article selects the UT action dataset. The UT dataset includes six types of actions: Hand Shaking, Hugging, Kicking, Pointing, Punching, and Pushing. There are two different scenarios for video capture in this dataset, one in the parking lot and the other in the windy lawn. As shown in Figs 8 and 9, the UT set 1 is an example of an action in the background of a parking lot, and the UT set 2 is photographed on a windy lawn. For the entire video dataset, video capture has different backgrounds different resolutions and different lighting conditions. These all bring challenges to the experiment of human interaction recognition.
UT set 1 dataset example.
UT set 2 dataset example.
In order to study the improvement effect of the video downsampling method based on Gaussian model, a comparative experiment is carried out for UT dataset 1. We adopt equal-interval downsampling video processing and Gaussian model-based downsampling video acquisition to video pre-process the UT whole video. And then we select Inception network and ResNet for feature extraction and classification recognition processing. The experimental results are shown in Table 1.
It can be seen from Table 1 that the Gaussian model based downsampling method can improve the overall accuracy of human interaction recognition. The experimental process found that the fusion time-phased Gaussian model down-sampling algorithm has an improved effect on the recognition of slamming and pushing human interaction. The improvement of the recognition of other actions is not obvious. The reason may be due to the limitations of the video in the dataset in this experiment. Some action video duration is shorter. There is no big difference in the choice of sampling methods.
Comparison of recognition accuracy of different sampling methods
Comparison of recognition accuracy of different sampling methods
In order to verify the improved effect of the parallel multi-feature fusion neural network, the comparison experiment of the UT dataset is now carried out. Now the experimental data analysis of the UT dataset 1 is performed. In the initial experiment, in this paper, the Inception network is used to extract the feature information of the two-person interaction in the feature extraction network. The Softmax classifier is used to classify the action directly. The results of the classification experiment of each type of action are shown in Fig. 10. Then we use the same idea to use the residual neural network to extract and classify the interactive feature information. The experimental results of each type of action are shown in Fig. 11. Among them, the video down sampling method adopts a video down sampling method based on a Gaussian model.
Classification of interactions based on Inception network.
Then, we use multi-feature fusion convolutional neural network for experimental comparison. In the contrast experiment, we use parallel Inception network and residual neural network to fuse at the feature level. And then connect the multi-layer perceptron with two layers of fully connected layers for further training. Finally, the Softmax classifier is used for classification. The experimental results are shown in Fig. 12. Table 2 compares the overall accuracy of the experimental results for a single network and a multi-feature fusion network.
Comparison of recognition accuracy of single network and parallel multi-feature fusion networks
Classification of interactions based on residual neural network.
Classification of interactions based on multi-feature fusion neural network.
The results of our experiments on the UT human interaction dataset are shown in Fig. 13. It can be seen from Fig. 13a that the accuracy of the training set and the validation set has steadily increased. At the same time, it can be seen from Fig. 13b that the loss rates of the training set and the validation set has both steadily decreased. In the end, they tend to coincide.
It can be seen from Figs 10–12 that in the recognition of the interaction, the motion recognition effect of the hug and the finger is better, and thus the characteristics of the two actions are obvious. The recognition accuracy of the punching action is low. It is easy to be misidentified as a push and handshake, and the characteristic information is not obvious. As shown in Table 2, compared with the single feature information extraction network, after using the parallel multi-feature fusion network algorithm, the recognition accuracy of the handshake action and the kick action is improved. It is proved that the Inception and ResNet parallel multi-feature fusion network algorithm proposed in this paper can improve the accuracy of human interaction recognition. In addition, in order to prove the credibility of this method, we also performed extended experiments on the UCF101 dataset. On the UCF101 human interaction dataset, the inception network obtained a recognition accuracy of 65.9%, the residual neural network obtained a recognition accuracy of 62.4%, and the parallel multi-feature fusion network obtained a recognition accuracy of 81.8%. It can be seen that the method of parallel multi-feature fusion network proposed in this paper can improve the accuracy of human interaction recognition.
Further analysis, the experimental results of this paper are compared with the classification results of other experimental methods on the UT dataset in recent years, and the experimental results obtained are shown in Table 3.
Comparison of recognition accuracy of different identification methods
The results of our experiments on the UT human interaction dataset: (a) Accuracy of training results; (b) loss function of training results.
This paper mainly discusses the feature extraction of human interaction. Considering the influence of time intervals on the amount of feature information, Gaussian downsampling model with fusion time phase feature is used in the video downsampling stage. In the action feature information extraction stage, we propose a parallel multi-feature fusion network algorithm based on Inception and ResNet. And it uses Inception network and ResNet for image feature extraction. The extracted feature information is parallelly merged, and then the feature information extraction training and classification are continued. Experiments on the set show that the parallel multi-feature fusion network can improve the accuracy of classification recognition compared with a single network. Subsequent work will consider more complex datasets for further validation and research.
Author contributions
Conceptualization, Zhong H.X., Ye Q. and Qu C.; methodology, Zhong H.X., Ye Q. and Qu C.; software, Zhong H.X. and Qu C.; validation, Zhong H.X.; formal analysis, Zhong H.X., Ye Q. and Qu C.; resources, Ye Q. and Zhang Y.M.; data curation, Zhong H.X. and Qu C.; writing – original draft preparation, Zhong H.X. and Qu C.; writing – review and editing, Ye Q. and Zhong H.X.; supervision, Ye Q. and Zhang Y.M.; project administration, Ye Q. and Zhang Y.M.; funding acquisition, Ye Q. and Zhang Y.M.
Funding
This research was funded by Technology Project of Beijing Municipal Education Commission (No. SQKM201810009002), National Natural Science Foundation of China (No. 61371143), National Natural Science Foundation of China (No. 61806008), Ministry of Education Science and Technology Development Center Project (No. 2018A03029).
