Human interaction recognition method based on parallel multi-feature fusion network

Abstract

Human activity recognition is a key technology in intelligent video surveillance and an important research direction in the field of computer vision. However, the complexity of human interaction features and the differences in motion characteristics at different time periods have always existed. In this paper, a human interaction recognition algorithm based on parallel multi-feature fusion network is proposed. First of all, in view of the different amount of information provided by the different time periods of action, an improved time-phased video down sampling method based on Gaussian model is proposed. Second, the Inception module uses different scale convolution kernels for feature extraction. It can improve network performance and reduce the amount of network parameters at the same time. The ResNet module mitigates degradation problem due to increased depth of neural networks and achieves higher classification accuracy. The amount of information provided in the motion video in different stages of motion time is also different. Therefore, we combine the advantages of the Inception network and ResNet to extract feature information, and then we integrate the extracted features. After the extracted features are merged, the training is continued to realize parallel connection of the multi-feature neural network. In this paper, experiments are carried out on the UT dataset. Compared with the traditional activity recognition algorithm, this method can accomplish the recognition tasks of six kinds of interactive actions in a better way, and its accuracy rate reaches 88.9%.

Keywords

Parallel multi-feature fusion network Gaussian model downsampling human interaction recognition

1. Introduction

As a research hotspot in the field of computer vision, human activity recognition technology has broad application prospects in many fields such as human-computer interaction [1, 2, 3, 4, 5, 6, 7], virtual reality [8] and motion analysis [9, 10, 11, 12, 13, 14]. In recent years, researchers are working to solve the difficult problems in this area, and it promotes the rapid development of this technology at the same time. Among them, how to effectively and quickly extract the feature information in the data becomes a big difficulty.

2. Related work

In the feature extraction, the global feature contains more information about the human body, and feature descriptor is sensitive to changes such as noise and occlusion phenomena. Local features such as space-time points of interest, which is independent of moving human body segmentation and tracking, are not very sensitive to noise and occlusion, but it is also difficult to extract stable space-time points of interest. Algorithms based on the combination of global and local features by combining the advantages of both are more common. Meng et al. [15] used different pose estimation to train different kinds of action models by detecting the position of joint points. They also extract the semantic spatial relationship between people, and combine the appearance features to realize human activity recognition. Wang et al. [16, 17] used the optical flow field to obtain the trajectory features in the video sequence. Then they extracted the optical flow direction histogram, the motion direction histogram, and the motion boundary histogram feature to obtain the motion descriptor. It can improve the activity recognition accuracy. Different from RGB image data, there are some algorithms based on depth image input information of 3D joint points. Vemulapalli et al. [18] used rotation and translation in 3D space to model the geometric relationship of various parts of the body. With the development of deep neural networks such as convolutional neural networks and long- and short-term memory networks, human activity recognition has gradually made greater progress, and algorithm performance has surpassed traditional algorithms. Simonyan et al. [19] proposed a dual-flow CNN network structure. In dual-flow CNN network structure, spatial stream processes a single frame image and time stream processes continuous multi-frame dense optical flow information. The dual-flow CNN network is trained by the space and time two dimensions of network training, and finally the results are fused and classified. In order to solve the problem of long time span of video sequences, Wang et al. [20] proposed a time segmentation network which is based on the dual-flow method. The network performs motion analysis on the overall motion video which is based on uniform sparse sampling and video level supervision. Ng et al. [21] combined the CNN network with the LSTM network to improve the performance of the classification algorithm by effectively expressing the sequence of frames through the memory unit. Tran et al. [22] proposed a new C3D network to extract the spatial temporal characteristics of video, and achieved good experimental results. The proposed 3D convolution can capture the timing information well, which inspired many researchers. Chiang et al. [23] proposed the 2D low-complexity hand gesture identification technology based on 3D depth information. The proposed design overcomes the difficulties to separate the integrated palm region from the complex background. They use time-view motion tracking to identify various human behaviors. Chen et al. [24] made the convergence speed as fast as possible, and they proposed a Multi-Fiber network. This splits the complex network into a lightweight network and uses the information flow between the fibers to introduce the multiplexer module. Wang et al. [25] are inspired by the classic non-local means in computer vision. Then a non-local operation is proposed as a general building block series for capturing remote dependencies. This method mainly studies the connection between two pixels with a certain distance on the image, and this method also studies the connection between video frames.

We focus on the study of human interaction recognition. Due to the differences of activity characteristics in different time periods of video, an improved down-sampling method based on Gaussian model is proposed. This method can extract video keyframes and remove the influence of a large amount of redundant information. Due to the high dimensionality of feature information, feature extraction based on parallel multi-feature fusion network is proposed to obtain feature information.

3. Materials and methods

In this paper, different sampling frequencies are used for the difference of motion characteristics of different time periods of video. We propose an improved fusion time-phase feature of Gaussian model to obtain video keyframes and remove the influence of a large amount of redundant information. Aiming at the complexity of human interaction features, we propose a human interaction recognition method based on parallel multi-feature fusion network from the perspective of multi-feature fusion. By transfer learning, the Inception network and the residual neural network (ResNet) are used to extract the information features. The extracted preliminary features are combined in parallel and continue to train, and the interactive action classification results are obtained. The overall experimental block diagram is shown in Fig. 1.

Figure 1.

Human interaction recognition of parallel multi-feature fusion network.

3.1 Video image preprocessing

Different video images have different sizes, different resolutions, uneven illumination, etc., which will affect the accuracy of human activity recognition. In order to reduce the adverse effects, it is necessary to perform pre-processing operations on the video image data.

In this paper, we use the UT dataset for experiments. This article selects the UT database. In order to adapt to the current requirements of large amounts of data and improve the robustness of the human motion recognition training model, it is necessary to perform data enhancement on the dataset.

Image data enhancement means taking measures to obtain more training data while retaining the original image data category labels. Common data expansion methods include image flipping, image rotation, image translation, image cropping and image scaling. The method of data expansion in this paper mainly uses image flipping, image cropping and image scaling. Flipping is to swap the positions of the two sides of the symmetry axis along the symmetry axis. Flipping is divided into horizontal flipping and vertical flipping. Random cropping is to randomly select the position of the image in the video within a certain range. Image scaling means scaling a part of the original image to another scale.

Taking into account the characteristics of motion video, we mainly use horizontal flipping and random crop to expand the amount of data. The horizontal flipping along the middle symmetry axis will double the amount of data exchanged between the left and right sides. Random cropping means randomly selecting the position of the image in the video within a certain range to generate more activity video segments. The partial processing results are shown in Figs 2 and 3.

Figure 2.

Image horizontal flipping example.

Figure 3.

Image random cropping example.

In the field of computer vision, image normalization is more important in preprocessing. It can prevent the effects of affine changes and geometric changes, and accelerate the speed of network convergence. For training data, this paper uses zero-mean normalization method for data standardization processing. After the zero-mean normalization process, the influence of different illumination can be weakened to a certain extent, and the convergence of the network during the training phase is accelerated. The zero-mean normalization processing data method is as shown in Eq. (1). In the formula, $\mu$ and $\sigma$ mean value and standard deviation of all pixel data in the training dataset. The processed data has a mean value of 0 and a standard deviation of 1. The processed data subject obeys normal distribution.

$\displaystyle\textit{norm}=\frac{x-\mu}{\sigma}$ (1)

In order to facilitate video processing and reduce data redundancy, it is necessary to downsample the interactive motion video to make it have more regular and identically distributed. In this paper, according to the time period of the action video, the motion video is divided into an action start period, an action execution period and an action end period. Among them, in the action start period and the action end period of each human body interaction video, the interacting individuals are mostly in separate parts of the frame, and the feature differentiation degree of different actions is relatively small. For example, in the beginning of the action of shaking hands and hugging in the video, the interactive individuals are all moving from the left and from the right middle part of the frame, and the motion discrimination degree is small. In the period of interactive action execution, the characteristics of different actions are quite different. The feature information of this stage plays an important role in motion recognition. In response to this, we propose an improved downsampling method based on Gaussian model.

Figure 4.

Gaussian probability density function distribution map.

As shown in Fig. 4, the Gaussian function is also called a normal distribution, and the expression of the probability density function is as shown in the Eq. (2).

$\displaystyle f(x)=\frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(x-\mu)^{2}}{2\sigma^{% 2}}}$ (2)

In the formula, $\mu$ is the expected value of the Gaussian distribution, $\sigma$ is the standard deviation of the Gaussian distribution, which is related to the variance of the Gaussian distribution. It can be seen from the distribution curve that the probability mass concentrates around the expected value. Relating to the rule of human interaction video, the closer to the video execution stage, the higher the degree of discrimination. This rule is similar to the Gaussian probability density function. In the UT dataset, each action video sample can be divided into an action start period, an action execution period, and an action end period. After analysis, it can be seen that different actions result in different feature values. The classification of the interaction action is more valuable, and the action frames of the action execution period are concentrated in the middle part of the video. By analogy with the Gaussian model, we propose a video downsampling method based on Gaussian probability distribution. During the video downsampling process, the middle part of the video is streched in time, while the beginning and the end parts are squeezed in time. The downsampling rate closer to the middle part is smaller, while it is more aggressive closer to the beginning and the end of video. The selection of the sampling interval needs to be obtained through repeated experiments, and the video features will be more differentiated after Gaussian sampling processing.

3.2 Parallel multi-feature fusion network algorithm

After the video data is preprocessed, feature extraction of the human interaction video is performed. For the UT dataset, despite a series of data enhancement operations in the early stage, the amount of human interaction data is still limited. In view of this situation, we adopt the method of transfer learning, which can greatly reduce the workload and achieve better training results.

In this paper, a convolutional neural network based on parallel multi-feature fusion is proposed for the extraction of interactive feature information, and the migration learning method is adopted in the feature extraction processing [26].

The block diagram of multi-feature fusion network algorithm based on Inception and ResNet parallel connection is shown in Fig. 5. First, we take the preprocessed video sequence as input into the parallel multi-feature network. Then, we import the obtained feature information into the multi-layer perceptron module. Finally, we put the obtained result to the classifier to get the final classification result.

Figure 5.

Block diagram of parallel multi-feature fusion network algorithm.

3.2.1 Parallel multi-feature network

Inception V3 [27] network is an important breakthrough in the history of convolutional neural networks. This paper uses Inception pre-training network to obtain feature information. In general, in order to improve network performance, the most common way for researchers is to increase the depth and width of the network, but this approach will generate huge amounts of parameters. As the number of network layers increases, the training process will be more cumbersome and will easily lead to over-fitting of data. In order to ensure the performance of computing while expanding the network, Inception network emerges. As shown in Fig. 6, this network structure clusters the sparse matrix into more dense sub-matrices to improve the computing performance. The depth and width of the neural network are modified. The large convolution kernel is split into small convolution kernels of different sizes such as 1 $\times$ 1, 3 $\times$ 3, and 5 $\times$ 5.

Figure 6.

Inception module.

The Inception structure uses different receptive fields to fuse different scale features to get more useful and reliable information. At the same time, these networks eliminate the fully connected layer in the network output layer. This method greatly reduces the parameters of the entire network and improves the efficiency of the training process. With the rapid development of deep learning technology, Inception network is constantly improving and innovating. For example, Inception V2 network introduces batch normalization layer, and uses two 3 $\times$ 3 convolution kernels to represent a 5 $\times$ 5 convolution kernel. The depth of the neural network improves its nonlinearity. The concept of asymmetric convolution is proposed in the Inception V3 network structure. In this structure, the large convolution kernel is further decomposed. In order to further reduce the parameter amount of the network and improve the speed of the operation, the network further decomposes the N $\times$ N-scale convolution kernel, and decomposes it into a 1 $\times$ N convolution kernel and an N $\times$ 1 convolution kernel. It can not only improve the operation speed but also alleviate the data over-fitting phenomenon.

We continue to explore the feature extraction method based on residual neural network [28]. As the convolutional neural network gets deeper and deeper, a series of problems have emerged. In the process of network training, as the number of network layers deepens, gradient vanishing or gradient explosion may occur. These problems make the network difficult to converge. Researchers hope to increase the nonlinear capability of neural networks by increasing the number of neural network layers. At the same time, they are trapped by the phenomenon of network degradation.

There is a problem of network performance degradation which is caused by the increase of the number of neural network layers. After the residual neural network proposed, the problem is improved and solved. Figure 7 shows the principle framework of the deep residual network. Different from the convolutional neural network structure in the past, this residual structure adds a short-circuit connection to the forward neural network. And it directly transfers the input information to the subsequent layer by skipping one or more layers of the network, simplifying the learning objectives of network.

Figure 7.

Residual learning: a building block.

There is a residual unit that can be expressed in the form of Eqs (3) and (4):

$\displaystyle y_{l}=h({x_{l}})+F({x_{l},W_{l}})(W_{l}=\{W_{l,k}|1\leqslant k% \leqslant K\})$ (3) $\displaystyle x_{l+1}=f(y_{l})$ (4)

In the formula, $x_{l}$ , $y_{l}$ represents the input and output values of the $l_{th}$ neuron, $h(x_{l})$ representative identity map, $f(y_{l})$ representation activation function. Under the condition of identity mapping: $y_{l}=f(y_{l})$ , $x_{l}=h(x_{l})$ . Then there is the Eq. (5).

$\displaystyle x_{l+1}=x_{l}+F(x_{l},W_{l})$ (5)

Similarly, when it is to the lth layer, the Eq. (6) can be obtained.

$\displaystyle x_{L}=x_{l}+\sum_{i=l}^{L-1}{F({x_{i},W_{i}})}$ (6)

It can be seen that when the number of layers is deeper and deeper, the output value of the network is related to the output of the residual in the previous layer. And the output of the Lth layer is the sum of the output values of the residuals of the previous layer. In backpropagation, calculating of the partial derivative of the loss function $\varepsilon$ , see Eq. (7).

$\displaystyle\frac{\partial\varepsilon}{\partial x_{l}}=\frac{\partial% \varepsilon}{\partial x_{L}}\frac{\partial x_{L}}{\partial x_{l}}=\frac{% \partial\varepsilon}{\partial x_{L}}\left({1+\frac{\partial}{\partial x_{l}}% \sum_{i=l}^{L-1}{F({x_{i},W_{i}})}}\right)$ (7)

It can be seen from the formula that in the process of gradient derivation, the case that the possible value is 0 in the multiplicative state is avoided, and the trouble caused by the disappearance of the gradient is excluded. After the network is improved in this way, even if the network is deeper, the results of network training will not be biased too much. This improving avoids the problems of gradient disappearance which is caused by the increase of the number of network layers, so that the network maintains good performance.

Taking the two-way interactive video in the UT dataset as an example, we use the Inception V3 pre-training model that based on the ImageNet dataset. According to the requirements of the model, the image in the activity video is adjusted to an image of size 299 $\times$ 299. The output of the last layer of the average pooling layer is used as the result of the preliminary feature information extraction of the Inception network. And the result is stored as a series of feature files, each of which generates a feature file. Each segment of the video generates a feature file. During the experiment, each group of action videos is cut into 40 frames, so the output size of the network is 2048 $\times$ 40. During the experiment, for the ResNet pre-training network, due to the input data requirements of the pre-training model, the image is cropped to a uniform 224 $\times$ 224 size at the feature extraction stage. Similarly, each video will generate a feature file which is extracted based on the residual neural network with a feature size of 2048 $\times$ 40.

The proposed network is a multi-feature fusion convolutional neural network based on Inception and ResNet. At the feature extraction level, feature extraction is performed using Inception V3 network and ResNet respectively. Then the two extracted features are fused at the fully connected layer. After that continue training, after multi-feature network convergence, the image features will be more abundant. It is conductive to improving the accuracy of human interaction video classification and recognition. According to the name of the video, we fuse two feature files generated by each video file to form a feature file with a size of 4096 $\times$ 40 for later training and classification.

3.2.2 Multilayer perceptron

After parallel feature extraction, further training and classification processing are performed. According to repeating experiments, the results show that in this experiment, when the number of fully connected layers is two, the activity recognition model obtained by training has the highest accuracy. Therefore, in this experiment, the feature fusion is connected to a multi-layer perceptron consisting of two layers of fully connected layers. And the Softmax classifier is selected for the classification of human interaction.

For a multi-layer perceptron, each layer consists of many neurons. The neurons of the adjacent layers are fully connected. The output of the hidden layer can be expressed as Eq. (8).

$\displaystyle h_{j}=f\left(\sum\nolimits_{i=1}^{n}{w_{ji}x_{i}+b_{i}}\right)$ (8)

Here, $w_{ji}$ and $b_{i}$ represent the weight and deviation of the hidden layer. The function $f$ represents the activation function, and the output $y$ of the neural network fully connected layer is the Eq. (9).

$\displaystyle y=f\left(\sum_{j=1}^{m}{w_{kj}h_{j}+b_{j}}\right)$ (9)

In the training process, the ReLU activation function is used to introduce the nonlinear relationship. Because of its simple form, the calculation amount can be reduced. And the phenomenon that the gradient disappears easily in the network propagation process can be alleviated. The ReLU function is shown in Eq. (10).

$\displaystyle\text{ReLU}(x)=\textit{Max}(x,0)$ (10)

At the same time, in the network training of this paper, Adam algorithm [29] is selected to correct the learning rate. As for the Adam algorithm such as Eq. (11), it can be seen that this automatic optimization algorithm can correct the offset value and has high learning efficiency and less complexity. The first order momentum is expressed in $m_{t}$ . Its correction value is $\hat{m}_{t}$ . $v_{t}$ is the second order momentum value. Its correction value is $\hat{v}_{t}$ . $\beta_{1}$ , $\beta_{2}$ are the magnitudes of the momentum value. The parameters of the t-th iteration model are recorded as $W_{t}$ . The gradient of the cost function for $W$ is $g_{t}=\varepsilon J(W_{t})$ . $\varepsilon$ is a very small value.

$\displaystyle\left\{\begin{array}[]{l}m_{t}=\beta_{1}m_{t-1}+(1-\beta_{1})g_{t% }\\ \\ v_{t}=\beta_{2}v_{t-1}+(1-\beta_{2})g_{t}^{2}\\ \\ \hat{m}_{t}=\frac{m_{t}}{1-\beta_{1}^{t}}\\ \\ \hat{v}_{t}=\frac{v_{t}}{1-\beta_{2}^{t}}\\ \\ W_{t+1}=W_{t}-\frac{\eta}{\sqrt{\hat{v}_{t}}+\varepsilon}\hat{m}_{t}\\ \end{array}\right.$ (11)

3.2.3 Classifier

In order to prevent data over-fitting, the dropout layer [30] was introduced into the network to reduce the co-adaptation relationship between neurons. In the last layer of the network, the output of the fully connected layer is fed into Softmax layer, so that each human interaction video passes through the classifier and has a prediction result containing probability significance.

Record training set as $\{(x^{(1)},y^{(1)}),\ldots,(x^{(m)},y^{(m)})\}$ , output $y^{({i})}\in\{1,2,3,\ldots,k\}$ , The total is divided into $k$ categories, and the input value $x$ is outputted by the classifier to generate an output value prediction. The prediction result will calculate the probability of each result as $p(y=j|x)$ , $j=(1,2,\ldots,k)$ . Such as Eq. (12), $h_{\theta}(x)$ output is a k-dimensional vector. Each dimension vector represents the probability value of the k-category result, and the classification result with the largest probability score is taken as the result of the action video classification. In the formula, $\theta_{1},\theta_{2},\theta_{3},\ldots,\theta_{k}\in\Re^{n+1}$ is the parameter of the model.

$\displaystyle h_{\theta}(x^{(i)})=\left[{{\begin{array}[]{c}{p(y^{(i)}=1|x^{(i% )};\theta)}\\ {p(y^{(i)}=2|x^{(i)};\theta)}\\ \ldots\\ {p(y^{(i)}=k|x^{(i)};\theta)}\\ \end{array}}}\right]=\frac{1}{\sum_{j=1}^{k}{e^{\theta_{j}^{T}x^{(i)}}}}\left[% {{\begin{array}[]{*{20}c}{e^{\theta_{1}^{T}x^{(i)}}}\\ {e^{\theta_{2}^{T}x^{(i)}}}\\ \ldots\\ {e^{\theta_{k}^{T}x^{(i)}}}\\ \end{array}}}\right]$ (12)

3.3 Pseudo code of parallel multi-feature fusion network algorithm

Algorithm: Parallel multi-feature fusion network algorithm
Input: $n_{b}$ , video sequence length; $X=\{x^{(n_{1})},\ldots,x^{(n_{b})}\}$ , video sequence; $\varepsilon$ , loss function; $w$ , hidden layer weights; $b$ , hidden layer deviation; $\theta$ , the model-parameter;
Output: $h_{\theta}$ , action video classification results;
1. for num $=$ 1; num $\leqslant$ action type numbers do
2. $X_{1}\leftarrow X=\{x^{(n_{1})},\ldots,x^{(n_{b})}\}$ //Input Inception Net’s video sequence sample
3. $X_{2}\leftarrow X=\{x^{(n_{1})},\ldots,x^{(n_{b})}\}$ //Input ResNet’s video sequence sample
4. for $L=1$ ; $L\leqslant$ layers do
5. for $l=1$ ; $l\leqslant$ number of neurons do
6. $h(x_{l})\leftarrow x_{l}$ ;
7. $f(y_{l})\leftarrow y_{l}$ ; //Update $h(x_{l})$ , $f(y_{l})$ according to Eqs (3)–(5)
8. end for
9. $x_{L}\leftarrow x_{l}$
10. $\varepsilon\leftarrow F(x_{i},W_{i})$ //Update $x_{L}$ , $\varepsilon$ according to Eqs (6) and (7)
11. end for
12. $y\leftarrow b$ , $w$ //Update the output $y$ of the fully connected layer of the neural network according to Eqs (3)–(5)
13. $h_{\theta}\leftarrow X$
14. end for

4. Results

4.1 Experimental platform and experimental data

This article is experimenting on a computer with NVIDIA GTX 1080Ti graphics card. The deep learning network framework uses Tensorflow, the programming language is mainly Python language. And other data analysis and pre-processing stages are used in Matlab language. The experiment is conducted on the Pycharm2018 and Matlab2016 platform. This article selects the UT action dataset. The UT dataset includes six types of actions: Hand Shaking, Hugging, Kicking, Pointing, Punching, and Pushing. There are two different scenarios for video capture in this dataset, one in the parking lot and the other in the windy lawn. As shown in Figs 8 and 9, the UT set 1 is an example of an action in the background of a parking lot, and the UT set 2 is photographed on a windy lawn. For the entire video dataset, video capture has different backgrounds different resolutions and different lighting conditions. These all bring challenges to the experiment of human interaction recognition.

Figure 8.

UT set 1 dataset example.

Figure 9.

UT set 2 dataset example.

4.2 Analysis of results

In order to study the improvement effect of the video downsampling method based on Gaussian model, a comparative experiment is carried out for UT dataset 1. We adopt equal-interval downsampling video processing and Gaussian model-based downsampling video acquisition to video pre-process the UT whole video. And then we select Inception network and ResNet for feature extraction and classification recognition processing. The experimental results are shown in Table 1.

It can be seen from Table 1 that the Gaussian model based downsampling method can improve the overall accuracy of human interaction recognition. The experimental process found that the fusion time-phased Gaussian model down-sampling algorithm has an improved effect on the recognition of slamming and pushing human interaction. The improvement of the recognition of other actions is not obvious. The reason may be due to the limitations of the video in the dataset in this experiment. Some action video duration is shorter. There is no big difference in the choice of sampling methods.

Table 1
Comparison of recognition accuracy of different sampling methods

Sampling method	Inception network	ResNet
Equal interval sampling	72.2	75
Downsampling based on Gaussian model	77.8	83.3

In order to verify the improved effect of the parallel multi-feature fusion neural network, the comparison experiment of the UT dataset is now carried out. Now the experimental data analysis of the UT dataset 1 is performed. In the initial experiment, in this paper, the Inception network is used to extract the feature information of the two-person interaction in the feature extraction network. The Softmax classifier is used to classify the action directly. The results of the classification experiment of each type of action are shown in Fig. 10. Then we use the same idea to use the residual neural network to extract and classify the interactive feature information. The experimental results of each type of action are shown in Fig. 11. Among them, the video down sampling method adopts a video down sampling method based on a Gaussian model.

Figure 10.

Classification of interactions based on Inception network.

Then, we use multi-feature fusion convolutional neural network for experimental comparison. In the contrast experiment, we use parallel Inception network and residual neural network to fuse at the feature level. And then connect the multi-layer perceptron with two layers of fully connected layers for further training. Finally, the Softmax classifier is used for classification. The experimental results are shown in Fig. 12. Table 2 compares the overall accuracy of the experimental results for a single network and a multi-feature fusion network.

Table 2

Comparison of recognition accuracy of single network and parallel multi-feature fusion networks

Feature extraction method	UT set 1 recognition accuracy (%)	UT set 2 recognition accuracy (%)	UCF101 interactive recognition accuracy (%)
Inception network	77.8	72.2	65.9
Residual neural network	83.3	80.5	62.4
Parallel multi-feature fusion network	88.9	83.3	81.8

Figure 11.

Classification of interactions based on residual neural network.

Figure 12.

Classification of interactions based on multi-feature fusion neural network.

The results of our experiments on the UT human interaction dataset are shown in Fig. 13. It can be seen from Fig. 13a that the accuracy of the training set and the validation set has steadily increased. At the same time, it can be seen from Fig. 13b that the loss rates of the training set and the validation set has both steadily decreased. In the end, they tend to coincide.

It can be seen from Figs 10–12 that in the recognition of the interaction, the motion recognition effect of the hug and the finger is better, and thus the characteristics of the two actions are obvious. The recognition accuracy of the punching action is low. It is easy to be misidentified as a push and handshake, and the characteristic information is not obvious. As shown in Table 2, compared with the single feature information extraction network, after using the parallel multi-feature fusion network algorithm, the recognition accuracy of the handshake action and the kick action is improved. It is proved that the Inception and ResNet parallel multi-feature fusion network algorithm proposed in this paper can improve the accuracy of human interaction recognition. In addition, in order to prove the credibility of this method, we also performed extended experiments on the UCF101 dataset. On the UCF101 human interaction dataset, the inception network obtained a recognition accuracy of 65.9%, the residual neural network obtained a recognition accuracy of 62.4%, and the parallel multi-feature fusion network obtained a recognition accuracy of 81.8%. It can be seen that the method of parallel multi-feature fusion network proposed in this paper can improve the accuracy of human interaction recognition.

Further analysis, the experimental results of this paper are compared with the classification results of other experimental methods on the UT dataset in recent years, and the experimental results obtained are shown in Table 3.

Table 3

Comparison of recognition accuracy of different identification methods

Recognition methods	UT set 1 recognition accuracy (%)	UT set 2 recognition accuracy (%)
HIS color space model [30]	81.7	–
Interactive phrases [31]	88.33	–
Artificial neural network [32]	83.5	72.5
Staged visual co-occurrence matrix sequence [33]	86	–
Method of this paper	88.9	83.3

Figure 13.

The results of our experiments on the UT human interaction dataset: (a) Accuracy of training results; (b) loss function of training results.

5. Discussion

This paper mainly discusses the feature extraction of human interaction. Considering the influence of time intervals on the amount of feature information, Gaussian downsampling model with fusion time phase feature is used in the video downsampling stage. In the action feature information extraction stage, we propose a parallel multi-feature fusion network algorithm based on Inception and ResNet. And it uses Inception network and ResNet for image feature extraction. The extracted feature information is parallelly merged, and then the feature information extraction training and classification are continued. Experiments on the set show that the parallel multi-feature fusion network can improve the accuracy of classification recognition compared with a single network. Subsequent work will consider more complex datasets for further validation and research.

Author contributions

Conceptualization, Zhong H.X., Ye Q. and Qu C.; methodology, Zhong H.X., Ye Q. and Qu C.; software, Zhong H.X. and Qu C.; validation, Zhong H.X.; formal analysis, Zhong H.X., Ye Q. and Qu C.; resources, Ye Q. and Zhang Y.M.; data curation, Zhong H.X. and Qu C.; writing – original draft preparation, Zhong H.X. and Qu C.; writing – review and editing, Ye Q. and Zhong H.X.; supervision, Ye Q. and Zhang Y.M.; project administration, Ye Q. and Zhang Y.M.; funding acquisition, Ye Q. and Zhang Y.M.

Funding

This research was funded by Technology Project of Beijing Municipal Education Commission (No. SQKM201810009002), National Natural Science Foundation of China (No. 61371143), National Natural Science Foundation of China (No. 61806008), Ministry of Education Science and Technology Development Center Project (No. 2018A03029).

References

J.X.

Jiang

G.Z.

G.F.

Sun

and Tao

, Intelligent human-computer interaction based on surface EMG gesture recognition, IEEE Access 7 (2019), 61378–61387.

Chiang

M.L.

Feng

J.K.

Zeng

W.L.

Fang

C.Y.

and Chen

S.W.

, A Vision-Based Human Action Recognition System for Companion Robots and Human Interaction, in: 2018 IEEE 4th International Conference on Computer and Communications (ICCC), China, 2018, pp. 1445–1452.

Deng

Pang

G.Y.

Zhang

Z.Y.

Pang

Z.B.

Yang

H.Y.

and Yang

, cGAN based facial expression recognition for human-robot interaction, IEEE Access 7 (2019), 9848–9859.

J.H.

Gao

H.W.

Yang

Jiang

Y.Q.

et al., A discriminative deep model with feature fusion and temporal attention for human action recognition, IEEE Access 8 (2020), 43243–43255.

Y.L.

Yang

Shen

F.M.

Shen

H.T.

and Zheng

W.S.

, Arbitrary-view human action recognition: a varying-view RGB-D action dataset, IEEE Transactions on Circuits and Systems for Video Technology 1(1) (2020), 99.

Chen

Shen

Y.Y.

Yan

Wang

D.H.

and Zhu

S.Z.

, Cholesky decomposition-based metric learning for video-based human action recognition, IEEE Access 8 (2020), 36313–36321.

Tufek

Yalcin

Altintas

Kalaoglu

and Bahadir

S.K.

, Human action recognition using deep learning methods on limited sensory data, IEEE Sensors Journal 20(6) (2020), 3101–3112.

Ping

J.M.

Liu

and Weng

D.D.

, Comparison in Depth Perception between Virtual Reality and Augmented Reality Systems, in: 2019 IEEE Conference on Virtual Reality and 3D User Interfaces (VR), Osaka, Japan, 2019, pp. 1124–1125.

Ahmed

M.U.

Kim

Y.H.

Jin

Bashar

and Rhee

P.K.

, Two person interaction recognition based on effective hybrid learning, KSII Transactions on Internet and Information Systems 13(2) (2019), 751–770.

10.

Chinimilli

P.T.

Redkar

and Sugar

, A two-dimensional feature space-based approach for human locomotion recognition, IEEE Sensors Journal 19(11) (2019), 4271–4282.

11.

Phyo

C.N.

Zin

T.T.

and Tin

, Deep learning for recognizing human activities using motions of skeletal joints, IEEE Transactions on Consumer Electronics 65(2) (2019), 243–252.

12.

Carreira

and Zisserman

, Quo vadis, action recognition? a new model and the kinetics dataset, in: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 4724–4733.

13.

Fang

and Lang

, Human activity recognition method based on molecular attributes, in International Journal of Distributed Sensor Networks 15(4) (2019).

14.

Sanzari

Ntouskos

and Pirri

, Discovery and recognition of motion primitives in human activities, PLOS ONE 14(4) (2019).

15.

Meng

L.X.

Qing

L.Y.

Yang

Miao

Chen

X.L.

and Metaxas

D.N.

, Activity recognition based on semantic spatial relation, in: International Conference on Pattern Recognition, 2012, pp. 609–612.

16.

Wang

Kläser

Schmid

et al., Dense trajectories and action boundary descriptors for action recognition, International Journal of Computer Vision 103(1) (2013), 60–79.

17.

Wang

and Schmid

, Action Recognition with Improved Trajectories, in: IEEE International Conference on Computer Vision, 2013, pp. 3551–3558.

18.

Vemulapalli

Arrate

and Chellappa

, Human Action Recognition by Representing 3D Skeletons as Points in a Lie Group, in: IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 588–595.

19.

Simonyan

and Zisserman

, Two-stream convolutional networks for action recognition in videos, Neural Information Processing Systems 1(4) (2014), 568–576.

20.

Wang

Xiong

Wang

et al., Temporal Segment Networks: Towards Good Practices for Deep Action Recognition, in: European Conference on Computer Vision, 2016, pp. 20–36.

21.

Joe

Y.H.

Hausknecht

Vijayanarasimhan

et al., Beyond Short Snippets: Deep Networks for Video Classification, in: IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 4694–4702.

22.

Tran

Bourdev

Fergus

et al., Learning Spatiotemporal Features with 3D Convolutional Networks, in: IEEE International Conference on Computer Vision, 2015, pp. 4489–4497.

23.

Chiang

and Fan

C.P.

, 3D Depth Information Based 2D Low-Complexity Hand Posture and Gesture Recognition Design for Human Computer Interactions, in: 2018 3rd International Conference on Computer and Communication Systems (ICCCS), Nagoya, 2018, pp. 233–238.

24.

Chen

Kalantidis

Yan

and Feng

, Multi-fiber networks for video recognition, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018.

25.

Wang

X.L.

Girshick

Gupta

and He

, Non-local neural networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.

26.

Szegedy

Vanhoucke

Ioffe

Shlens

and Wojna

, Rethinking the Inception Architecture for Computer Vision, in: IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2818–2826.

27.

K.M.

Zhang

X.Y.

Ren

S.Q.

and Sun

, Deep Residual Learning for Image Recognition, in: IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.

28.

Kingma

and Ba

, Adam: A Method for Stochastic Optimization, Computer Science, 2014.

29.

Srivastava

Hinton

Krizhevsky

et al., Dropout: a simple way to prevent neural networks from overfitting, Journal of Machine Learning Research 15(1) (2014), 1929–1958.

30.

Huang

F.F.

Cao

J.T.

and Ji

X.F.

, Two-person interactive motion recognition algorithm based on multi-channel information fusion, Computer Technology and Development 26(3) (2016), 58–62.

31.

Kong

Jia

Y.D.

and Fu

, Learning Human Interaction by Interactive Phrases, in: Computer Vision – ECCV 2012 12th European Conference on Computer Vision, Florence, Italy, 2012, pp. 300–313.

32.

Mahmood

Jalal

and Sidduqi

M.A.

, Robust Spatio-Temporal Features for Human Interaction Recognition Via Artificial Neural Network, in: 2018 International Conference on Frontiers of Information Technology (FIT), Islamabad, Pakistan, 2018, pp. 218–223.

33.

X.F.

and Zu

X.M.

, Two-person interactive recognition based on staged visual co-occurrence matrix sequence, Computer Engineering and Design 38(9) (2017), 2498–2503.

Human interaction recognition method based on parallel multi-feature fusion network

Abstract

Keywords

1. Introduction

2. Related work

3. Materials and methods

4. Results

4.1 Experimental platform and experimental data

Table 1 Comparison of recognition accuracy of different sampling methods

Author contributions

Funding

References

Table 1
Comparison of recognition accuracy of different sampling methods