Gesture recognition based on multilevel multimodal feature fusion

Abstract

With the development of human-computer interaction, gesture recognition has gradually become one of the research hotspots. The cost reduction and the richer information of RGB-D images make the research of gesture recognition based on RGB-D images more and more. However, the current gesture processing methods for RGB-D images still can not fully utilize the information contained. Aiming at the above problems, this paper studies the feature extraction method of RGB-D image, and proposes a multimodal and multilevel feature extraction method. By extracting multimodal and multilevel image features for mapping and splicing, the utilization of RGB-D image information and the accuracy in recognition are improved effectively. Finally, the experiments verified the effectiveness and robustness of the proposed method based on the self-built gesture database. Compared and analyzed with several other RGB-D processing methods, the processing method of this paper is more advanced and effective, and can achieve better results in gesture recognition.

Keywords

Gesture recognition RGB-D image multilevel and multimodal fusion feature extraction

1 Introduction

Traditional human-computer interaction methods mainly need input and output devices to interact, such as keyboards, mice, monitors, etc [1]. Different from the human communication methods, the interaction with complex tasks is very difficult. Therefore, in order to realize human-computer interaction in a more natural and direct way, many scholars use visual methods to capture and collect human body behaviors and actions, and conduct corresponding research [2, 3]. As one of the most flexible and complex communicative limbs in the human body, hand is one of the cores of human-computer interaction research in the recognition of posture or behavior. The significance of the research on gesture recognition is to map the behavior or action produced by human hands into a machine-understandable behavior, so that the interaction between human and machine no longer requires complex and inaccurate manipulation of the intermediate media, but directly manipulates the machine through human hands. In recent years, the gesture manipulation has been widely used in various products of various industries. The large-scale intuitive experience of human-computer interaction methods changes has led to a broad application prospects on gesture recognition technology, such as virtual reality, entertainment games, industrial control and other fields [4].

With the development of sensors such as Kinect, RGB and depth images can be collected synchronously, which provides new research ideas and methods for how to achieve more intelligent human-computer interaction [5, 6]. The development of low-cost RGB-D cameras allows adding scene-to-camera distance information to RGB images. The color, texture, appearance and geometric information of the target can be obtained better by using both of depth images and RGB images, so as to perform better. Different methods are usually used to extract features from the two modes (RGB and depth) when using RGB-D data to recognize, such as using different handmade feature descriptors (SIFT, Textons and Depth Edges, etc) [7, 8] to match object in two modes. But there are many problems in the traditional feature extraction method, especially in gesture, the human hand is flexible, changeable, and has many degrees of freedom, also its motion speed will affect the change of gesture. The gesture change of human is complex and uncertain. When recognizing gestures in real environment, we need to consider the influence of environment, gesture size, different light intensity, hand movement speed and other factors on gestures. The accuracy, real-time and robustness of gesture recognition are required to be high, making gesture recognition still have a large room for improvement [9, 10].

In the remainder of this paper, Sect.2 reviews the research status at home and abroad. In Sect.3, the methods of collecting and preprocessing samples are introduced. And then, the multilevel multimodal fusion framework is introduced and a multilevel multimodal network structure is designed in Sect.4. Subsequently, the experimental results of gesture recognition are analyzed. In the last section, conclusions are presented.

2 Related work

At present, the precision of image sensor is getting higher and higher, and the cost of data acquisition is also decreasing. Therefore, gesture recognition based on computer vision has gradually become one of the research hotspots and core points, which has been widely concerned and studied by scholars [1].

Belgacem, S et al [2] proposed a novel markovian hybrid system CRF/HMM for gesture recognition, and a novel motion description method called gesture signature for gesture characterization. The model was applied to the recognize gestures in videos and achieved good performance, remained independent from the moving object type. Different from the recognition method based on image features [3], the model-based method detects the edge contour of the hand in the image firstly, establishes the hand model according to the edge contour, and finally recognizes the gesture through the geometric shape of the hand model [4]. The model-based gesture recognition method can get better recognition results, but it often needs to search and match in high-dimensional space, so that it is difficult to achieve real-time gesture recognition [5]. Li et al built a SAE-PCA gesture recognition network based on the combination of sparse self-encoder and CNN network. The network extracts RGB-D image features based on neural network and uses SVM classifier to recognize 24 static letter gestures successfully [6].

Since 2010, some classical 3D sensors, such as Kinect, Xtion and Leap Motion, have been appearing continuously, which greatly reduces the complexity of 3D human-computer interaction. The binocular and multi-eye cameras obtain distance information through multiple cameras and then determine the location of the target through camera calibration. In short, the images obtained by these devices have not only image information but also depth information [7].

Pigou, L [8] et al. explored deep architectures for gesture recognition in video and proposed a new end-to-end trainable neural network architecture incorporating temporal convolutions and bidirectional recurrence. He et al. [9] put forward an image feature set which contains three kinds of Haar features and applied it to pedestrian detection and face recognition and achieved good results. However, due to the fewer features, the use of this feature requires a larger training set, which makes its practical application difficult [10]. Since the hand is one of the flexible and variable parts of the non-rigid human body, there can be subtle differences in completing all kinds of movements, which makes each movement distorted. So Yimin Zhou et al. [11] proposed to extract palm and finger features separately to compensate for the robustness of gesture recognition. Bhuyan [12] et al. realized the detection of gesture area based on skin color features and assisted by face detection, and then realized the recognition of 10 gesture areas. Sharp T [13] took the prime difference of gesture image as feature and realized the detection and tracking of gesture. P et al [14] uses pixel histogram to show the relationship between the number of fingers, and distinguishes between gestures 1 and 9. Eventually, its average recognition rate reached about 90%. Escalante, HJ et al [15] detected the outline information of the finger and judged the category by its specific number and direction. Lenz et al. [16] proposed a multimodal deep learning method for robotic grabbing detection, which uses stacked automatic encoders for multimodal feature learning. In addition, the introduction of quantum computer system in the traditional neural network can improve the convergence of the network and the stability of the algorithm. The literature [23] proposed a new quantum watermarking scheme based on quantum wavelet transform including scrambling, embedding and extracting procedures, and the method is verified to be robust. Literature [24] proposed a quantum image coding scheme, randomly generating a binary key for each pixel of the image. Reference [25] presented sequre quantum key distribution based on a special Deutsch-Jozsa algorithm using Greenberger-Horne-Zeilinger states. Studies [26 –29] conducted an in-depth study of Multipartite quantum correlations.

In summary, although there are many methods for gesture recognition, vision-based gesture recognition still faces many serious problems in practice [17 –19]. Most of them use color images acquired by a single camera or grayscale images and binary images processed by color images to recognize, but it is constrained by environmental illumination, background complexity and human skin color [20, 21]. It is always a difficulty to extract target gestures from complex and uncontrollable background. In these feature learning methods based on RGB-D images, the relationship between different modes has not been studied in depth. Most methods either learn features from color and depth patterns separately, or simply treat RGB-D as undifferentiated four-channel data [22]. The main disadvantage of separated learning is that it ignores the relationship between the two modes. The characteristic learning of one mode is not adjusted by another mode [30]. The main disadvantage of a simple four-channel learning is that this combination may not have physical significance or utilize the different features of the form [31, 32].

3 Gesture sample collection and preprocessing

3.1 Sample collection

In order to verify the performance of the proposed multimodal fusion method for gesture recognition under different conditions, it is necessary to create an RGB-D gesture database, which considers the noise effects of different illumination intensity, angle and size gesture samples [33]. Kinect is used to collect ten kinds of gesture samples from different people to represent the number 0–9. In order to make the experiment more convenient, the 10 numbers of 0–9 are expressed by gesture 1–10, as shown in Fig. 1. Each type of gesture chooses 2000 samples, 500 samples as test samples, and 1500 samples as training samples. Therefore, the total number of gesture samples for self-built gesture data is 40,000, of which 20,000 RGB images and corresponding 20,000 depth images. Figure 2 is part of the samples collected in this paper. These samples are collected under different scales, rotation angles and illumination conditions. Figure 2 (d) is a noise-added image, which is used as a sample to verify the robustness of the algorithm. In this chapter, the frame number of Kinect is 30FPS, the resolution is 640*480, the hardware is CPU-i5, the memory is 8 G, and the Ubuntu 16.04 system is used.

Fig.1

10 RGB-D gesture samples.

Fig.2

Partial sample images collected under different conditions.

3.2 Sample preprocessing

The data obtained by Kinect can not be directly used as input of computer vision algorithms [34, 35]. Most algorithms use both RGB data and depth data. In order to combine RGB image with depth data correctly, it is necessary to align the output of RGB camera with that of depth camera [36]. In addition, the original depth data is very noisy, and many pixels in depth images may not be able to accurately collect depth data due to reflection or surface scattering, such as human tissues and hair [37]. Those missing “holes” need to be restored before using, therefore, Kinect data needs to be recalibrated or filtered [38].

3.2.1 Depth data calibration

Assuming that the three-dimensional coordinate system of the depth camera is M_d = [X_d, Y_d, Z_d] ^T, the three-dimensional coordinate system of the color camera is M = [X, Y, Z] ^T, and the projection coordinate of the two-dimensional color image is m = [u, v] ^T, the corresponding relationship between the two coordinate systems is established by translating and rotating the matrix: $M = R M_{d} + t$ (1)

In the Equation (1), R is a rotation matrix and t is a translation vector [39]. According to the principle of keyhole imaging, there is the following relationship between the three-dimensional coordinates of the camera and the pixels of the two-dimensional projection: $m = KM$ (2)

In the Equation (2), K is the camera internal parameter matrix [40].

The coordinate system is expressed in homogeneous form and the pinhole model is used to model the color camera. $Z [\begin{matrix} u \\ v \\ 1 \end{matrix}] = [\begin{matrix} α & γ & μ_{0} & 0 \\ 0 & β & v_{0} & 0 \\ 0 & 0 & 1 & 0 \end{matrix}] [\begin{matrix} X \\ Y \\ Z \\ 1 \end{matrix}]$ (3)

In the Equation (3), α and β are the scale factors of the pixels on the image plane [41]. Since the upper left corner of the pixel coordinate system is the (0,0) point, (u₀, v₀) is the coordinate of the origin of the image coordinate system in the pixel coordinate system and γ is the skewness of the coordinate axis.

The pixels of the depth image are marked as x = [u, v, z] ^T, (u, v) is the pixel coordinates, z is the depth value, The mapping from x to M_d is known, and denoted as M_d = f (x). The rotation and translation between the color three-dimensional coordinates and the depth three-dimensional coordinates are expressed as follows. $[\begin{matrix} X \\ Y \\ Z \\ 1 \end{matrix}] = [\begin{matrix} R & t \\ 0^{T} & 1 \end{matrix}] [\begin{matrix} X_{d} \\ Y_{d} \\ Z_{d} \\ 1 \end{matrix}]$ (4)

From the above analysis, the internal parameters of the camera are related to the value of α, β, γ, u₀, v₀, and the external parameters are related to the value of R, t. Zhang [49] et al. provided a set of measurements of internal and external parameters. ${\begin{matrix} α = 528.32 \\ β = 527.03 \\ γ = 0 \\ u_{0} = 320.10 \\ v_{0} = 257.57 \end{matrix}$ (5)

The rotation angle and translation vector are expressed in spherical coordinates as follows.

$\begin{matrix} {[θ_{x}, θ_{y}, θ_{z}, t_{x}, t_{y}, t_{z}]}^{T} \\ = [0.05, - 0.01, 0.02, 25, 2, - 2]^{T} \end{matrix}$ (6)

3.2.2 Deep data filtering

3.2.2.1. Median filtering. Median filter is mainly used for noise reduction in image processing. Its characteristic is retaining the edge information of the image [50]. For the acquired images, the unprocessed images contain missing information, these missing values are more likely to have similar values with their neighbors [51]. Therefore, median filter fills the vacancies while preserving the edges. The D without depth data is given a pixel coordinate (u, v), ω is the neighborhood of (u, v), the depth value after median filtering is: $D (u, v) = f_{median} {D (i, j), (i, j) \in ω}$ (7)

3.2.2.2. Bilateral filtering. The principle of bilateral filter is based on Gauss distribution and it is used to reduce noise and image smoothing in image processing [42, 43]. The filtered pixels are replaced by the weighted average of neighborhood pixels in the detection window [44]. The weight depends on the spatial distance between two pixels [45]. This method has been effectively used to solve discontinuities in Kinect images.

Assuming that D (x, y) is a depth image, P_x,y is a pixel in the image, W_x,y is a weight normalization factor, ω is the neighborhood of pixel (x, y), G_{σ
_s} and G_{σ
_r} represent the weight of spatial range thresholds s and pixel range thresholds r in a Gauss function respectively, the depth value after bilateral filtering is: $\begin{matrix} B (D (x, y)) = \frac{1}{W_{x, y}} \sum_{(x^{'}, y^{'}) \in ω} D (x^{'}, y^{'}) \\ G_{σ_{s}} (∥ P_{x, y} - P_{x^{'}, y^{'}} ∥ G_{σ_{r}} (∥ D_{x, y} - D_{x^{'}, y^{'}} ∥) \end{matrix}$ (8)

3.2.2.3. Joint bilateral filtering. Joint bilateral filtering is an improvement of bilateral filtering. On the basis of bilateral filtering, color information is taken as additional information. Unlike the bilateral filtering formula, the color image I is used replace the depth image D to calculate the weight [46]. $\begin{matrix} B (D (x, y)) = \frac{1}{W_{x, y}} \sum_{(x^{'}, y^{y}) \in ω} D (x^{'}, y^{'}) \\ G_{σ_{s}} (∥ P_{x, y} - P_{x^{'}, y^{'}} ∥ G_{σ_{r}} (∥ I_{x, y} - I_{x^{'}, y^{'}} ∥) \end{matrix}$ (9)

Median filter is mainly used to compensate missing data, while bilateral filter is used to smooth adjacent pixels. In depth images, median filtering can fill the missing data well, but it has no effect on the fluctuation of depth with time. In contrast, bilateral filter is used to smooth depth values, but it only works in the spatial domain [47]. Therefore, defining a comprehensive noise model is an important first step to achieve effective depth image filtering. As shown in Fig. 3, gesture samples are processed based on bilateral filters. The left image is the original depth image and the right image is the filtered gesture image.

Fig.3

Gesture sample preprocessing based on bilateral filter.

4 Multilevel multimodal fusion framework and network structure design

4.1 Multilevel and multimodal fusion framework

Conventional methods for recognition based on RGB-D images are: 1) separate learning features of RGB images and depth images. 2) simple processing of RGB-D images as four-channel data. 3) features are extracted by different networks firstly and fused in the last full-connection layer. 4) only use the last layer to output features for prediction or classification. the features of the front layer are not considered. 5) considering the difference of features among different modes and fusing them. 6) fusing and classifying the features of different levels [48 –50].

There is a problem in these methods, that is, the relationship between different modes is not fully considered or the complementarity between different levels of features is not taken into account. Based on this, a pair of convolution neural networks is used to extract the features of RGB and depth images, and the features extracted at different levels are analyzed in this paper. The features extracted at the same level are mapped and stitched to get the feature core of multimodal fusion at the same level. Then the features of multimodal fusion at different levels are arranged according to the time sequence, and the low-level fusion features are used to assist the high-level features to form a more compact fusion feature core. Figure 4 shows a multimodal multilevel fusion feature extraction method framework.

Fig.4

Framework of feature extraction method based on multilevel and multimodal fusion.

There are multiple different levels of feature extraction layers in SSD (Single Shot MultiBox Detector, SSD) network structure [51, 52]. It is precisely because of this feature extraction method that it runs fast and has good recognition accuracy even in low resolution pictures [53]. Based on the idea, two CNN networks [54] with the same structure are used to extract different abstract hierarchical features of RGB modal data and Depth modal data respectively, and then fusing feature in appropriate ways to obtain more discriminative and robust RGB-D fusion features.

As shown in Fig. 5, i^rgb ∈ I^rgb represents the input RGB image, i^depth ∈ I^depth represents the input Depth image, l_True ∈ L_True represents the corresponding label of the image, where I^rgbI^depthL_True represent RGB image set, Depth image set and label set respectively. $X_{RGB}^{i}$ and $X_{DEPTH}^{i}$ represent the RGB and Depth output characteristics of layer i of CNN network, where i = 1, 2, . . . , N, N is the number of layers per CNN network. As mentioned in the previous paper, the same structure CNN feature extraction network is used for RGB mode and Depth mode. In this way, feature extraction is not considered separately from any one level, but from different levels to complete feature extraction and combination, fully considering the impact of different modes and different levels on feature extraction, which lays a good foundation for the final gesture recognition.

Fig.5

Characteristic extraction diagram of dual-stream convolution neural network.

4.2 Network structure design and feature mapping splicing

4.2.1 Network architecture

Because of the difference between the two modes, this paper constructs a smaller network for each network, so as to ensure that the data of the two modes can be placed in GPU memory at the same time. The size of the input image is adjusted to 150 * 150. In RGB mode, the convolution kernels of layers 1, 2, 3, 4 and 5 are 7*7*3, stride 2, number 96; 5*5*96, stride 2, number 96; 3*3*96, stride 1, number 112; 3*3*112, number 128, stride 1; 3*3*128, number 128, stride 1, respectively. The size of the two full-connected layers is 1024 and 512, respectively. The first full-connected layer dropout size is 0.5. For each 150 * 150 image, the overlapped 142 * 142 image is clipped for data expansion. There is a maximum pooling layer after the first, second and fifth convolution layers, and the ReLU nonlinearity is applied to the output of each convolution layer and each full-connected layer. When the CNN is initialized by independent training using RGB and depth images, the size of the final full-connected layer equals the number of categories, and then it is input into the final SoftMax layer. In addition to the size of the convolution core in the first convolution layer (RGB image has three channels, that is, the first convolution core is 7*7*3. Depth image has one channel, that is, the first convolution core is 7*7*1), the same network architecture is used for RGB mode and depth mode in this paper.

4.2.2 Feature mapping mosaic

For the output characteristics of different convolution layers, if we want to apply them jointly, we should first consider how to solve the problem of lack of one-to-one correspondence among the elements of different feature vectors caused by different dimensions [55, 56]. More formally, different eigenvectors generally have different dimensions [57], so two different dimensions have different eigenvector spaces. In order to fully consider the characteristics of different abstract levels, it is necessary to map all the features of different dimensions into a unified feature space, so as to achieve feature mosaic [58, 59].

The feature mapping model consists of two convolution layers (including ReLU) and a global maximum pooling layer. The feature input of different abstraction layers is transformed into the same size vector (1*n), where n represents the number of convolution kernels. As shown in Fig. 2. This paper designs a feature mapping model in this way: the first convolution layer uses n convolution kernels of size 7*7 to extract the spatial size, width and height of feature vectors, while the second convolution layer uses n convolution kernels of size 1*1 to extract the depth information of feature vectors. Finally, the global maximum pool computes the maximum of each depth slice. The size of convolution core is tried according to the rule of thumb, and the best performance is selected to determine its size after this attempt. Figure 6 is the structure diagram of feature mapping module, through which features $f_{i}^{*}$ can be converted to mapping features $F_{i}^{*}$ . In the graph, the first layer conv (k*k) * n denotes that the first layer has n convolution kernels of k*k, and the second layer is the same, the ReLU denotes the activation layer with ReLU nonlinearity.

The feature mapping module is shown in Fig. 6, the above mapping operation is expressed by a formula as follows: $F_{i}^{*} = G_{i}^{*} (f_{i}^{*}) s . t . F_{i}^{*} \in X \forall i .$ (10)

Fig.6

Feature mapping module.

In the formula, $f_{i}^{*}$ represents the output characteristics of different abstract levels of different modes, where ^* represents two different modes of RGB or depth, i represents different abstract levels, $G_{i}^{*}$ represents the feature mapping module, different modes have different feature mapping module; $F_{i}^{*}$ represents the mapping eigenvector; X represents the common eigenvector space of different modes, in which the vector dimensions of different modes are identical. After completing the feature mapping mentioned above, the feature mosaic in each abstract level and two modes can be expressed as $F_{i} = [F_{i}^{RGB}; F_{i}^{DEPTH}]$ . Several fusion features can be obtained by splicing the features of different abstract levels. At this time, multimodal and multilevel fusion feature kernels can be obtained by sorting the features of different abstract levels according to the time sequence.

5 Analysis of experimental results of gesture recognition

In order to demonstrate the superiority of the multimodal gesture recognition method in this paper, based on the self-built gesture database, several deep learning methods for RGB-D gesture recognition are constructed by ResNet [60, 61], and compared with the methods in this paper. Several different methods are designed as follows: 1) building ResNet network with only depth image as input, as shown in Fig. 7(c). 2) building ResNet network with only RGB image as input, as shown in Fig. 7(c). 3) using RGB-D data as four-channel data to identify ResNet network, as shown in Fig. 7(a). 4) The RGB-D bimodal data is used and combined in the full-connected layer of the last layer, as shown in Fig. 7(b). Four network structures are shown in Fig. 7, in which the left-most part is the input sample, the large cube represents the RGB image and the small one represents the Depth image; the right-most red cube represents the output of the network. In the middle part of the network structure, the deep blue cube represents the convolution layer, the light blue cube represents the pooling layer, and the green cube represents the full-connected layer; the dotted line represents the input RGB. Or Depth image, it is optional input.

Fig.7

Different processing methods for RGB-D data.

After completing the above feature extraction, the extracted features are connected to the LSTM network for classification and recognition, and the output of the LSTM network is connected to the Softmax layer, then output the classification result. The models of three different methods in Fig. 7 above are constructed, and the final classification accuracy can be obtained by running the models on the self-built gesture database. As shown in Table 1, the accuracy comparison of different methods on the self-built gesture database is shown. The comparison of the tables, proves that the accuracy of the network which simply uses RGB-D data as four-channel data input is higher than that of the single-mode network, but the gesture recognition rate on the fusion network is better than that of other methods.

Table 1

Experimental results of different network structure methods in RGB-D gesture dataset

Methods	Maximum recognition rate (%)	Minimum recognition rate (%)	Accuracy rate (%)
RGB-CNN (Single mode)	87.3	81.2	84.3
Depth-CNN (Single mode)	85.5	80.5	83
CNN (RGB-D As a four-channel data input)	93.5	89.4	91.5
RGB-Depth fusion (Full connection layer fusion)	94.4	91.3	92.9
The method proposed in this paper	97.2	95.6	96.4

According to the recognition accuracy of several different structure methods on the self-built gesture database, the feasibility and accuracy of the proposed method are verified. However, the recognition accuracy of the proposed method is unknown compared with the most advanced methods. Therefore, several current cutting-edge object recognition methods or gesture recognition methods based on RGB-D images are discussed. The effectiveness and accuracy of the proposed method are compared and analyzed and the experimental results are shown in Table 2.

Table 2

Experimental results of different network models on self-built gesture data sets

Methods	Maximum recognition rate (%)	Minimum recognition rate (%)	Accuracy rate (%)
Reference [12]	90.2	84.5	87.4
Reference [13]	92.3	83.8	88.05
Reference [11]	95.3	85.2	90.25
Reference [20]	92.1	90.4	91.25
Reference [19]	95.4	91.3	93.35
The method proposed in this paper	97.2	95.6	96.4

In order to understand the quality of the proposed method more intuitively and clearly, the confusion matrix [63] of this method is calculated on the basis of RGB-D gesture database, as shown in Fig. 8.

Fig.8

The confusion matrix of the method proposed in this paper on the self-built gesture database.

As shown in Fig. 8, the diagonal line of the obfuscation matrix represents the recognition accuracy of each category, with the value from 0 to 1 and 1 is the highest. From the figure, we can see that there are still some misunderstandings in the classification of some gestures, but the overall accuracy of gesture recognition is quite good.

In order to verify the generalization performance of the model, the method is compared with several methods with superior performance in American Sign Language (American Sign Language, ASL) [64, 65]. The results are shown in Table 3.

Table 3

Experimental results of different network models in the ASL gesture dataset

Methods	Maximum recognition rate (%)	Minimum recognition rate (%)	Accuracy rate (%)
Reference [19]	98	92	95.6
Reference [20]	99.9	85.6	95.4
Reference [21]	97.3	89.7	94.4
Reference [22]	98.2	91	94.6
The method proposed in this paper	99	95.4	97.2

At the same time, in order to understand the validity and accuracy of the method in the ASL database, the confusion matrix on the database is calculated, as shown in Fig. 9.

Fig.9

The obfuscation matrix of the proposed method on ASL gesture database.

From the comparison results of the maximum recognition rate, the minimum recognition rate and the average recognition rate in Table 3, the method in this paper is superior. The 24 types of static gesture samples in the ASL database are selected as data samples. From Fig. 9, it can be seen that the proposed method has better recognition effect.

6 Conclusion

Firstly, introduced the sample database of RGB-D gesture image and preprocessed the sample. Then designed the multimodal and multilevel fusion gesture recognition framework. On the basis of the above, designed the structure of convolutional neural network with two modes, and extracted the feature of different abstract levels under different modes. In order to solve the problem of different feature dimensions of different modes, a feature mapping model is designed to map the features of two different modes into the same feature space, and then completed the feature mosaic. The multilevel and multimodal fusion features are obtained by arranging the multimodal fusion features of each level in time sequence. Finally, input the feature into LSTM network, and connected the output of the network to the Softmax layer to output the classification and recognition results. By comparing different RGB-D gesture image processing methods, it is demonstrated that the proposed method has better processing effect and higher recognition rate. Then compared with other advanced methods, the recognition rate and feasibility of this method are superior.

Footnotes

Acknowledgments

This work was supported by grants of National Natural Science Foundation of China (Grant Nos. 51575407, 51505349, 51575338, 51575412, 61733011); the Grants of National Defense Pre-Research Foundation of Wuhan University of Science and Technology (GF201705) and Open Fund of the Key Laboratory for Metallurgical Equipment and Control of Ministry of Education in Wuhan University of Science and Technology (2018B07).

References

Cheng

W.T.

, Sun

, Li

G.F.

, Jiang

G.Z.

and Liu

H.H.

, Jointly network: A network based on CNN and RBM for gesture recognition, Neural Computing and Applications 31(Suppl 1) (2019), 309–323.

Belgacem

, Chatelain

and Paque

, Gesture sequence recognition with one shot learned CRF/HMM hybrid model, Image and Vision Computing 61 (2017), 12–21.

Sun

, Li

C.Q.

, Li

G.F.

, Jiang

G.Z.

, Jiang

, Liu

H.H.

, Zheng

Z.G.

and Shu

W.N.

, Gesture recognition based on Kinect and sEMG signal fusion, Mobile Networks and Applications 23(4) (2018), 797–805.

Chakraborty

B.K.

, Sarma

, Bhuyan

M.K.

and MacDorman

K.F.

, Review of constraints on vision-based gesture recognition for human-computer interaction, IET COMPUTER VISION 12(1) (2018), 3–15.

, Wang

J.X.

and Ju

Z.J.

, A novel hand gesture recognition based on high-level features, International Journal of Humanoid Robotics 15(2) (2018). DOI: 10.1142/S0219843617500220

, Mi

, Li

G.F.

and Ju

Z.J.

, CNN-Based facial expression recognition from annotated RGB-D images for Human–Robot interaction, Humanoid Robotics (2019). DOI: 10.1142/S0219843619410020

Tan

, Sun

, Li

G.F.

, Jiang

G.Z.

, Chen

D.S.

and Liu

H.H.

, Research on gesture recognition of smart data fusion features in the IoT, Computing and Applications (2019). DOI: 10.1007/s00521-019-04023-0

Pigou

, Van Den Oord

and Dieleman

, Beyond temporal pooling: Recurrence and temporal convolutions for gesture recognition in video, International Journal of Computer Vision 126(2–4) (2018), 430–439.

, Li

G.F.

, Liao

Y.J.

, Sun

, Kong

J.Y.

, Jiang

G.Z.

, Jiang

, Tao

, Xu

and Liu

H.H.

, Gesture recognition based on an improved local sparse representation classification algorithm, Cluster Computer (2017). DOI: 10.1007/s10586-017-1237-1

10.

J.X.

, Jiang

G.Z.

, Li

G.F.

and Sun

, Intelligent human-computer interaction based on surface EMG gesture recognition, IEEE Access 7 (2019), 61378–61387.

11.

Zhou

, Jiang

and Lin

, A novel finger and hand pose estimation technique for real-time hand gesture recognition, Pattern Rrcognition 49 (2016), 102–114.

12.

Blum

M.K.

, Kumar

D.A.

and MacDorman

K.F.

, A novel set of features for continuous hand gesture recognition, Journal on Multimodal User Interfaces 8(4) (2014), 333–343.

13.

Sharp

, Keskin

and Robertson

, Accurate, robust, and flexible real-time hand tracking, Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing System (2015), 3633–3642.

14.

Barros

, Maciel-Junior

N.T.

, Fernandes

B.J.T.

, Bezerra

B.L.D.

and Fernandes

S.M.M.

, A dynamic gesture recognition and prediction system using the convexity approach, Computer Vision and Image Understanding 155 (2017), 139–149.

15.

Escalante

H.J.

, Guyon

, Athitsos

, Jangyodsuk

and Wan

, Principal motion components for one-shot gesture recognition, Pattern Analysis and Applications 20(1) (2017), 167–182.

16.

Lenz

, Lee

and Saxena

, Deep learning for detecting robotic grasps, The International Journal of Robotics Research 34 (2015), 705–724.

17.

Fang

Y.F.

, Zhou

D.L.

, Li

and Liu

H.H.

, Interface prostheses with classifier-feedback-based user training, IEEE Transactions on Neural Systems and Rehabilitation Engineering 64(11) (2017), 2575–2583.

18.

Sagayam

K.M.

and Hemanth

D.J.

, Application of Pseudo 2-D hidden Markov model for hand gesture recognition, Advances in Intelligent Systems and Computing 507 (2017), 179–188.

19.

Marin

, Dominio

and Zanuttigh

, Hand gesture recognition with jointly calibrated leap motion and depth sensor, Multimedia Tools and Application 75(22) (2016), 14991–15015.

20.

Zhao

and Du

, Spectral-spatial feature extraction for hypersperctral image classification: A dimension reduction and deep learning approach, IEEE Transactions on Geoscience and Remote Sensing 54(8) (2016), 4544–4554.

21.

Pigou

, VanDenOord

and Dieleman

, Beyond temporal pooling: Recurrence and temporal convolutions for gesture recognition in video, International Journal of Computer Vision 126(2–4) (2018), 430–439.

22.

Wang

, Li

and Ogunbona

, RGB-D-based human motion recognition with deep learning: A survey, Computer Vision and Image Understanding 171 (2018), 118–139.

23.

Heidari

, Naseri

, Gheibi

, Baghfalaki

, Pourarian

M.R.

and Farouk

, A new quantum watermarking based on quantum wavelet transforms, Communications in theoretical Physics 67(6) (2017), 732–742.

24.

Naseri

, Abdolmaleky

, Parandin

, Fatahi

, Farouk

and Nazari

, A new quantum gray-scale image encoding scheme, Communications in Theoretical Physics 69(2) (2018), 215–226.

25.

Nagata

, Nakamura

and Farouk

, Quantum cryptography based on the Deutsch-Jozsa algorithm, International Journal of Theoretical Physics 56(9) (2017), 2887–2897.

26.

Batle

, Ooi

C.R.

, Farouk

, Abutalib

and Abdalla

, Do multipartite correlations speed up adiabatic quantum computation or quantum annealing? Quantum Information Processing 15(8) (2016), 3081–3099.

27.

Batle

, Farouk

, Tarawneh

and Abdalla

, Multipartite quantum correlations among atoms in QED cavities, Frontiers of Physics 13(1) (2018), 130305.

28.

Batle

, Ooi

C.R.

, Farouk

, Alkhambashi

M.S.

and Abdalla

, Global versus local quantum correlations in the Grover search algorithm, Quantum Information Processing 15(2) (2016), 833–849.

29.

Nagata

, Nakamura

, Geurdes

, Batle

, Abdalla

and Farouk

, Creating very true quantum algorithms for quantum energy based computing, International Journal of Theoretical Physics 57(4) (2018), 973–980.

30.

Jiang

, Li

G.F.

, Sun

, Kong

J.Y.

and Tao

, Gesture recognition based on skeletonization algorithm and CNN with ASL database, Multimedia Tools and Applications (2018). DOI: 10.1007/s11042-018-6748-0

31.

G.F.

, Jiang

, Zhou

Y.L.

, Jiang

G.Z.

, Kong

J.Y.

and Manogaran

, Human Lesion Detection Method Based on Image Information and Brain Signal, IEEE Access 7 (2019), 11533–11542. DOI: 10.1109/ACCESS.2019.2891749

32.

Lee

D.L.

and You

W.S.

, Recognition of complex static hand gestures by using the wristband-based contour features, IET Image Processing 12(1) (2018), 80–87.

33.

Sun

, Hu

J.B.

, Li

G.F.

, Jiang

G.Z.

, Xiong

H.G.

, Tao

, Zheng

Z.J.

and Jiang

, Gear Reducer Optimal Design based on Computer Multimedia Simulation, The Journal of Supercomputing (2018). DOI: 10.1007/s11227-018-2255-3

34.

Nyirarugira

, Choi

H.R.

and Kim

, Hand gesture recognition using particle swarm movement, Mathematical Problems in Engineering 1 (2016), 1–8.

35.

Jiang

, Zheng

Z.J.

, Li

G.F.

, Sun

, Kong

J.Y.

, Jiang

G.Z.

, Xiong

H.G.

, Tao

, Xu

, Liu

H.H.

and Ju

Z.J.

, Gesture recognition based on binocular vision, Cluster Computing (2018). DOI: 10.1007/s10586-018-1844-5

36.

Liu

and KehtarnavazWang

, A, Real-time robust vision-based hand gesture recognition using stereo images, Journal of Real-time Image Processing 11(1) (2016), 201–209.

37.

Chen

D.S.

, Li

G.F.

, Sun

, Kong

J.Y.

, Jiang

G.Z.

, Tang

, Ju

Z.J.

, Yu

and Liu

H.H.

, An interactive image segmentation method in hand gesture recognition, Sensors 17(2) (2017), 253. DOI: 10.3390/s17020253

38.

D’Orazio

, Marani

, Reno

and Cicirelli

, Recent trends in gesture recognition: How depth data has improved classical approaches, Image and Vision Computing 52 (2016), 56–72.

39.

, Sun

, Li

G.F.

, Kong

J.Y.

, Jiang

G.Z.

, Jiang

, Tao

, Xu

and Liu

H.H.

, Gesture recognition based on modified adaptive orthogonal matching pursuit algorithm, Cluster Computing 22(Suppl 1) (2019), 503–512.

40.

Obo

, Loo

CK.

, Seera

and Kubota

, Hybrid evolutionary neuro-fuzzy approach based on mutual adaptation for human gesture recognition, Applied Soft Computong 42 (2016), 377–389.

41.

C.C.

, Li

G.F.

, Jiang

G.Z.

, Chen

D.S.

and Liu

H.H.

, Surface EMG data aggregation processing for intelligent prosthetic action recognition, Neural Computing and Applications (2018). DOI: 10.1007/s00521-018-3909-z

42.

Huang

Y.J.

, Yang

X.C.

, Li

Y.F.

, Zhou

D.L.

and Liu

H.H.

, Ultrasound-based sensing models for finger motion classification, IEEE Journal of Biomedical and Health Informatics 22(5) (2018), 1395–1405.

43.

G.F.

, Kong

J.Y.

, Yang

J.T.

, Huang

X.C.

and Hou

, Genetic algorithm and its application research, prospect in mechanical optimization design, Dynamics of Continuous Discrete and Impulsive Systems-Series A-Mathematical Analysis 13 (2006), 1446–1453.

44.

Dinh

D.L.

, Lee

and Kim

T.S.

, Hand number gesture recognition using recognized hand parts in depth images, Multimedia Tools and Applications 75(2) (2016), 1333–1348.

45.

J.B.

, Sun

, Li

G.F.

, Jiang

G.Z.

and Tao

, Probability analysis for grasp planning facing the field of medical robotics, Measurement 141 (2019), 227–234.

46.

Nuzzi

, Pasinetti

, Lancini

, Docchio

and Sansoni

, Deep learning-based hand gesture recognition for collaborative robots, IEEE Instrumentation & Measurement Magazine 22(2) (2019), 44–51.

47.

Hua

, Li

G.F.

, Jiang

, Zhao

H.Y.

and Qi

J.X.

, An optimized selection method of channel numbers and electrode layouts for hand motions recognition, International Journal of Humanoid Robotics (2019). DOI: 10.1142/S0219843619410068

48.

Panella

and Altilio

, A smartphone based application using machine Learning for gesture recognition, IEEE Instrumentation & Measurement Magazine 8(1) (2019), 25–29.

49.

Zhang

and Zhang

, Calibration between depth and color sensors for commodity depth cameras, Computer Vision and Machine Learning with RGB-D Sensors (2014). DOI: 10.1007/978-3-319-08651-4_3

50.

G.F.

, Wu

, Jiang

G.Z.

, Xu

and Liu

H.H.

, Dynamic gesture recognition in the internet of things, IEEE Access 7 (2019), 23713–23724.

51.

Skaria

, Al-Hourani

, Lech

and Evans

R.J.

, Hand-gesture recognition using Two-Antenna doppler radar with deep convolutional neural networks, IEEE Sensors Journal 19(8) (2019), 3041–3048.

52.

G.F.

, Zhang

L.L.

, Sun

and Kong

J.Y

, Towards the sEMG hand: Internet of things sensors and haptic feedback application, Multimedia Tools and Applications (2018). DOI: 10.1007/s11042-018-6293-x

53.

Deng

, Yang

, Tao

, Deng

L.G.

, Liu

D.Q.

, Guan

Z.Q.

, Li

G.F.

, Li

Z.L.

, Yu

S.H.

, Zheng

G.X.

, Li

Z.Y.

and Zhang

, Spatial Frequency Multiplexed Meta-Holography and Meta-Nanoprinting, ACS Nano 13 (2019), 9237–9246.

54.

, Jiang

Z.G.

, Zhang

, Wang

, Yang

Y.H.

and Li

, An integrated MCDM approach considering demands-matching for reverse logistics, Journal of Cleaner Production 208 (2018), 199–210.

55.

G.F.

, Tang

, Sun

, Kong

J.Y.

, Jiang

G.Z.

, Jiang

, Tao

, Xu

and Liu

H.H.

, Hand gesture recognition based on convolution neural network, Cluster Computing (2017). DOI: 10.1007/s10586-017-1435-x

56.

Sagayam

K.M.

and Hemanth

D.J.

, A probabilistic model for state sequence analysis in hidden Markov model for hand gesture recognition, Computational Intelligence 35(1) (2019), 59–81.

57.

Luo

B.W.

, Sun

, Li

G.F.

, Chen

D.S.

and Ju

Z.J.

, Decomposition algorithm for depth image of human health posture based on brain health, Neural Computing and Applications (2019). DOI: 10.1007/s00521-019-04141-9

58.

Byun

S.W.

and Lee

S.P.

, Hand gesture recognition suitable for wearable devices using flexible epidermal tactile sensor array, Journal of Elecrical Engineering & Technology 13(4) (2018), 1731–1738.

59.

Z.J.

, Ji

X.F.

, Li

and Liu

H.H.

, An integrative framework of human hand gesture segmentation for human-robot interaction, IEEE Systems Journal 11(3) (2017), 1326–1336.

60.

M.C.

, Li

G.F.

, Jiang

, Tao

and Chen

D.S.

, Hand medical monitoring system based on machine learning and optimal EMG feature set, Personal and Ubiquitous Computing (2019). DOI: 10.1007/s00779-019-01285-2

61.

, Fang

Y.F.

, Zhou

, Ju

Z.J.

and Liu

H.H.

, Haptics model for human fingertips based on gaussian distribution, Journal of Intelligent & Fuzzy Systems 36(5) (2019), 3945–3955.

62.

G.F.

, Li

J.H.

, Ju

Z.J.

, Sun

and Kong

J.Y.

, A novel feature extraction method for machine learning based on surface electromyography from healthy brain, Neural Computing and Applications (2019). DOI: 10.1007/s00521-019-04147-3

63.

Choi

H.R.

and Kim

, Modified dynamic time warping based on direction similarity for fast gesture recognition, Mathematical Problems in Engineering (2018), 1–9.

64.

J.X.

, Jiang

G.Z.

, Li

G.F.

, Sun

and Tao

, Surface EMG hand gesture recognition system based on PCA and GRNN, Neural Computing and Applications (2019). DOI: 10.1007/s00521-019-04142-8

65.

Jiang

, Li

G.F.

, Sun

, Kong

J.Y.

, Tao

and Chen

D.S.

, Grip Strength Forecast and Rehabilitative Guidance Based on Adaptive Neural Fuzzy Inference System Using sEMG, Personal and Ubiquitous Computing (2019). DOI: 10.1007/s00779-019-01268-3