Abstract
With the continuous development of sensor and computer technology, human-computer interaction technology is also improving. Gesture recognition has become a research hotspot in human-computer interaction, sign language recognition, rehabilitation training, and sports medicine. This paper proposed a method of hand gestures recognition which extracts the time domain and frequency domain features from surface electromyography (sEMG) by using an improved multi-channels convolutional neural network (IMC-CNN). The 10 most commonly used hand gestures are recognized by using the spectral features of sEMG signals which is the input of the IMC-CNN model. Firstly, the third-order Butterworth low-pass filter and high-pass filter are used to denoise the sEMG signal. Secondly, effective sEMG signal segment from denoised signal is applied. Thirdly, the spectrogram features of different channels’ sEMG signals are merged into a comprehensive improved spectrogram feature which is used as the input of IMC-CNN to classify the hand gestures. Finally, the recognition accuracy of IMC-CNN model, three single channel CNN of IMC-CNN model, SVM, LDA, LCNN and EMGNET are compared. The experiment was carried out on the same dataset and the same computer. The experimental results showed that the recognition accuracy, sensitivity and accuracy of the proposed model reached 97.5%, 97.25% and 96.25% respectively. The proposed method not only has high average recognition accuracy on MYO collected dataset, but also has high average recognition accuracy on NinaPro DB5 dataset. Overall, the proposed model has more advantages in accuracy and efficiency than that of the comparison models.
Introduction
Hands play an important role in human daily life and work. The number of people with disabilities due to hand diseases or accidents is increasing, which affect the patients’ life quality seriously. With the development of society and medical treatment, people’s demand for the life quality is increasing, and the effect of rehabilitation is required to be more accurate and perfect. In order to improve the self-care ability of patients with hand disability, intelligent prosthetic hand and other medical auxiliary products gradually has become a research hotspot, and improve the daily life of patients with hand disability to a certain extent [1]. Accurate recognition of hand motion intention is one of the key technologies of intelligent prosthetic hand [2]. Therefore, improving the accuracy and reliability of hand movement recognition has important research value.
At present, there are two kinds of research on hand gesture recognition. The first one is based on video images to recognize. Vishwakarma and Grover [3] presented a method of recognizing the hand gesture by using depth image. This method supported a vision-based gesture recognition driving system, which can recognize at low intensity. Dhingra N and Kunz A [4] proposed a gesture recognition method by using end-to-end trained 3D residual attention network. Based on the superposition of multiple attention blocks, they constructed a three-dimensional network and generated different features on each attention block. Their 3D attention based residual network can be built and extended to a very deep level. Obaid F. et al. [5] used two deep learning architecture models to recognize hand gesture. The first one was combination of a convolutional neural network and a recurrent neural network with a long short-term memory to form a new deep learning architecture. The second one is composed of two parallel convolutional neural networks and a long-term memory recurrent neural network with RGB-D feedback. Although these kinds of gesture recognition have a good recognition effect, because these recognition methods are based on the image, there are high requirements for the external environment such as illumination and non-occlusion, such as wearing gloves.
The other type of hand gesture recognition is based on various intelligent visual acquisition sensors, such as Microsoft’s Kinect and Ultraleap’s Leap Motion. Wang et al. [6] combined Microsoft’s Kinect sensor’s RGB image and depth image data to recognize hand gestures and estimated the movement of each finger. They realized real-time hand recognition, tracking and motion estimation and guaranteed the robust of their methods. Lee et al. [7] built a real-time gesture recognition system, which was based on holographic display. They got the depth image from the sensor of Microsoft Azure Kinect firstly, and then they used CrossInfoNet deep learning model to get the information of hand and hand joints. Finally, they used these kinds of information to define and recognize a variety of basic rotation gestures. Wang et al. [8] realized real-time hand tracking by extracting the hand region of Kinect image and determining the fingertip. Firstly, they used a fast hand convex hull extraction method to obtain the main candidate points, and then determined the final fingertip based on two typical constraints of hand shape features. Finally, they used the fingertip position to realize real-time hand tracking. Li et al. [9] introduced a gesture recognition system based on Leap Motion. They put forward a spatial fuzzy matching (SFM) algorithm which could realize the matching and fusion of spatial information. Inaddition, an initial frame correction strategy for dynamic hand recognition was proposed, which was based on the motion trajectory of gesture data set to initialize gesture quickly.
Ameur et al. [10] proposed Chronological Pattern Indexing (CPI) method to encode the gesture time series data collected by the sensor of Leap Motion. They extracted a set of temporal patterns from different optimized projections firstly. Then, they compared their temporal order, and encoded the whole sequence with the index of the first pattern. They repeated these steps until an effective feature vector was generated to simulate the temporal dynamics of gestures. This kind of gesture recognition had been widely used, and its real-time and portability had been further improved. However, the accuracy of gesture recognition of above methods also depends on the surrounding environment and the depth data from the sensor. At the same time, these methods are also vision-based gesture recognition methods, and cannot solve the problem of hand occlusion.
Common methods for hand gesture recognition include SVM [11, 12], LDA [13], hidden Markov model [14, 15], decision tree [16, 17] and artificial neural network (ANN) [18, 19]. Lecun et al. [20] presented the original Convolutional Neural Network (CNN) model, and CNN has developed rapidly. CNN is a kind of deep feed forward neural network, which has been widely concerned in recent years. It is a supervised machine learning method and is mainly used to solve the problems in the field of computer vision, such as machine vision, image recognition and so on. X. Li et al. [21] proposed a method to extract the detailed features of the ground object features by using multiscale convolution kernels. Dhall, S. et al. [22] used a deep convolutional neural network model to recognize hand gesture by the hand image. It is also widely used in speech recognition, translation and other more fields. Krizhevsky et al. [23] proposed Alexnet which was composed of five convolutional layers, and then were followed by max-pooling layers, three fully connected layers and a final 1000-way softmax. At present, Rahimian et al. [24] proposed a novel deep learning architecture for using sEMG signal to recognize hand gestures by increasing the network’s receptive field gradually.
Although some of above methods have higher recognition accuracy, they are either inefficient or less classification. At present, the sEMG recognition method with multiple classification, high recognition accuracy and high efficiency has become an urgent problem to be solved in real-time sEMG control. Aiming at the above limitations of gesture recognition, a gesture recognition method based on IMC architecture was proposed, which used the spectral features of sEMG signal as input data. The sEMG data were collected from the flexor carpi radialis, palmaris longus, flexor carpi ulnaris and flexor digitorum superficialis. Firstly, the sEMG signal was denoised and segmented to get effective data of hand gestures. Secondly, Short-time Fourier transform (STFT) with Sliding Hamming Window method [25] was used to extract the spectrum features of sEMG signal. Thirdly, the spectrum features were reduced to 64 * 1 format data by using dimension reduction spectrum method [26], and the four channel sEMG signals were combined into 64 * 4 format data, and 64 * 4 format data was converted the into 16 * 16 two-dimensional spectrum features image data. Finally, 16 * 16 spectrum features image data was used as the input of IMC-CNN for hand gestures recognition. This method has high recognition rate and strong stability, and can’t be affected by environmental factors such as light. It is of great significance for prosthetic control and precise remote control.
The main contribution of this paper is to build a multi-channel CNN network model that emphasizes both time-domain and frequency-domain information to improve the accuracy of gesture recognition. Although the recognition method proposed in this paper is slightly lower than the traditional SVM and LDA in recognition efficiency, it has higher recognition efficiency than other CNN methods, and the recognition accuracy is much higher than SVM and LDA.
This paper consists of six parts: introduction, sEMG signal collection and preprocessing, spectrograms extraction and model construction, experiments, results and conclusion.
SEMG signal collection and preprocessing
SEMG signal collection
There are many muscles in the forearm that are linked to the bending and extension of the fingers. The sEMG signals of four main muscles of the forearm were collected in the following experiments, which were flexor carpi radialis, flexor palmaris longus, flexor carpi ulnaris and flexor digitorum superficialis. The muscles whose sEMG signals are collected by the bracelet electrode are shown in Fig. 1.

The muscles whose sEMG signals are collected by the bracelet electrode.
The DTS series of wireless telemetry sEMG produced by gforce EMG Bracelet company for EMG signal acquisition was used in the following experiments. It supports eight sEMG channels and consists of differential dry electrodes. The sampling frequency of each channel of DTS is 2000 Hz. The sEMG signals from three men and three women without any forearm injury were collected. The average age of the six subjects was 35. During the experiments, in order to avoid the influence of muscle fatigue on the experimental results, the subjects had a full rest for three days before the experiments, and did not carry out any strenuous exercise. With the consent of the subjects, the subjects signed a written informed consent and privacy agreement before participating in the experiment. All the subjects agreed to use the collected personal sEMG signal data for teaching and medical research free of charge. The personal conditions and health status of the six subjects are shown in Table 1.
The personal conditions and health status of the six subjects
The purpose of the experiments is to recognize 10 most commonly used hand gestures. The classes are: clench, palm out, thumb up, thumb down, peace sign, four finger extension, palm up, palm down, palm left, palm right. Each hand gesture data is collected from subjects’ right forearm within 3 seconds of holding still. Figure 2 shows the different gestures.

The 10 hand gestures recognized in this paper.
As mentioned previously, the experiments used eight sEMG channels, each of which sampled with frequency of 2000 Hz. In the process of data acquisition, each gesture was repeated 5 times and kept still for 3 seconds. In order to prevent the data from being inaccurate due to muscle fatigue, the subjects rested for one minute after completing each gesture. The effective experimental data of 300 groups from 6 subjects were collected. Each person completed 10 gestures five times. Each group data contained 8 channels of sEMG.
SEMG signal denoising
The sEMG signal is a relatively weak low-frequency signal with a range of 20 – 500 Hz [27], so it is necessary to filter and preprocess the original signal. Because the DTS system provides a voltage interference shielding module to eliminate the 50 Hz power frequency notch processing [28], the integrity of the original signal is ensured. The experimental band-pass filter was composed of cascaded third-order Butterworth low-pass filter and third-order Butterworth high pass filter [29]. The cut-off frequencies of low-pass filter and high pass filter are 500 Hz and 20 Hz respectively. The attenuation rate of the filter is 18 dB per octave. The noise outside the effective frequency range is removed. The comparison between the original sEMG signal and the denoised signal is shown in Fig. 3.

The comparison between the original sEMG signal and the denoised signal.
When different subjects make the same gesture, the way of muscle contraction is often different [30]. Even if the same subject makes the same gesture different time, the contraction of each muscle will be different, which lead to a certain difference in the length and amplitude of sEMG signals collected each time under the same gesture. During the experiment, the sliding window energy threshold method is used to segment the sEMG signal automatically, and the invalid redundant sEMG signal segments are removed [31], so as to reduce the problems of low recognition rate and increased data processing due to invalid data entering the recognition system.
The n sampling points’ mean short-term energy E
i
is expressed as [32]:
Where i is the ith segment signal, j is the sampling point number of ith segment’s and d ij is the value of the ith segment’s jth sampling point.
In the above formula, if n is too large, the energy difference will be reduced, resulting in the expansion of effective data segments; if n is too small, the energy difference will be increased, resulting in the removal of a large number of effective data. After setting n = 40, 50, 80, 100, 150 and 200 experiments, it is found that when sampling fs = 2000 Hz, n is set to 100, that is, the mean energy calculated from 50 ms sEMG data corresponds well with the actual EMG effective signal, and a better segmentation effect is obtained. The mean short-term energy diagram through the sliding window with n = 100 is shown in Fig. 4.

The mean short-term energy diagram through the sliding window with n = 100.
Firstly, the fixed threshold of the starting point (FTSP) of the gesture interval is calculated. The part data of the subjects’ sEMG signals are segmented manually, then the ratio of the mean energy at the starting point of each interval (Ei-start) to the maximum of the mean energy (Max (E
i
)) are calculated, and then the mean value of all the ratios and multiply by the maximum of the gesture interval (Ei-max) are calculated, and finally the FTSP of the interval is calculated. The FTSP is defined as:
Secondly, the fixed threshold of the ending point (FTEP) of the gesture interval are calculated. The ratio of the mean energy at the ending point of each interval (Ei-end) to the maximum of the mean energy (Max (E
i
)) are calculated, and then the mean value of all the ratios and multiply by the maximum of the gesture interval (Ei-max) are calculated, and finally the FTEP of the gesture interval is calculated. The FTEP was defined as:
Thirdly, whether the current point is the starting point of currt window’s effective data is calculated. When E
i
> FTSP, the first k mean short-term energy’s mean value was less than the current window’s mean short-term energy and the latter m mean short-term energy’s mean value is greater than the current window’s mean short-term energy, the ith point is the starting point of current window’s effective data, otherwise it is not. k is defined as:
The quantity range of m is defined as:
Fourthly, whether the current point is the ending point of current window’s effective data is calculated. When E i < FTEP and the mean value of the latter k mean short-term energy is less than the current window’s mean short-term energy, the ith point is the ending point of current window’s effective data, otherwise it is not.
As mentioned above, the fixed threshold FTSP and FTEP are set to ensure the acquisition of effective data segments of gesture [33], and eliminate the ineffective data of rest interval and the difference between gesture and muscle. The segment of effective sEMG signal reduces the interference caused by motion differences and environmental factors, and provides a good data base for the accuracy of subsequent gesture recognition [34]. The segmentation of sEMG signal is shown in Fig. 5.

The segmentation of sEMG signal.
Extraction of SEMG spectrograms
If the traditional rectangular window is used to extract the surface EMG spectrum directly, the frequency leakage of sEMG signal may be caused. In order to prevent this situation, sliding Hamming window is used to perform short-time Fourier transform (STFT) on the sEMG signal. Hamming window function is defined as:
Where a is usually set to 0.46. Because the Hamming window has a large amplitude frequency characteristic of side lobe attenuation, the peak attenuation of the main lobe and the first side lobe can reach almost 40 dB, which leads to the obvious data information in the middle and the attenuation of the data information on both sides after adding the window. When the window is moved, such as 1 / 3 or 1 / 2 window moved, the previous window’s attenuation data will reappear, which makes the window sliding relatively stable and minimizing the spectral diffusion.
In order to improve the recognition efficiency and further reduce the delay time of recognition, this paper used STFT to extract the spectrum features of sEMG signal. The formula of STFT function is defined as:
The spectrum of each channel is extracted by using STFT on the sEMG signals of all channels through Hamming window’s sliding, where the horizontal axis of the spectrum is time and the vertical axis is frequency. Then, the spectrograms (SPM) features are processed by segmenting the effective frequency band (20–500 Hz) of the sEMG signal spectrum to retain the effective spectrum data. The parameters of the spectrogram are set fs of 2000, window of 400, noverlap of 200 and nfft of 400.
Because different gestures lead to different muscle contraction, the activity state of each channel is different, and it is not every channel contains enough effective information. The channel data with low correlation are discarded to reduce the redundant data of spectrum features and improve the recognition accuracy and efficiency. Firstly, the SPM features of each channel are transformed into 64 * 1 one-dimensional features. Principal component analysis (PCA) [35] is used to delete some low correlation channel features, so as to reduce the frequency direction dimension of sEMG data. Eight channels of SPM feature correspond to eight principal components. According to the cumulative variance contribution rate, the number of retained principal components is determined to reduce the amount of data of SPM features, and the original 64 * 8 SPM features are optimized to obtain the improved SPM features. The variance contribution rate and cumulative variance contribution rate of SPM are shown in Table 2 and Fig. 6.
SPM features’ variance and accumulated variance contribution rate

Variance contribution rate.
The cumulative variance contribution rate of the four principal components of SPM feature is 99.2%. Therefore, the four principal components of SPM features are used as valid features. The improved SPM features contain enough effective information from the original eight principal components, optimizes the amount of data, and reduces the feature size from 64 * 8 to 64 * 4. The reduced sEMG spectrum can effectively represent most of the effective information in the original high-dimensional sEMG spectrum, which can make the data effectively adapt to the CNN classifier, reduce the processing time of the classifier, and improve the recognition efficiency. Taking clenching gesture as an example, the improved SPM features of sEMG data are obtained by randomly segmenting a piece of sEMG data. The improved SPM feature acquisition process is shown in Fig. 7.

The improved SPM feature acquisition process.
IMC-CNN is composed of three parallel convolution layers and pooling layers channels and two fully connected layers and softmax layers. Each channel is composed of two convolution layers and two pooling layers with the same structure but different convolution kernels. The convolution results are spliced and then sent to two fully connected layers and softmax layers for classification to get the final recognition results. The network architecture of IMC-CNN is shown in Fig. 8. In channel 1, the first convolution layer uses 10 3rd-order convolution kernels with the size of 2×4×1, and the second convolution layer uses 20 3rd-order convolution kernels with the size of 2×4×10. In channel 2, the first convolution layer uses 10 3rd-order convolution kernels with the size of 4×2×1, and the second convolution layer uses 20 3rd-order convolution kernels with the size of 4×2×10. In channel 3, the first convolution layer uses 10 3rd-order convolution kernels with the size of 2×2×1, and the second convolution layer uses 20 3rd-order convolution kernels with the size of 2×2×10. In order to improve the method’s robustness and reduce the local noise’s influence on the recognition accuracy, an average pooling layer is added after each convolution layer. In each channel, the step size of convolution and pooling layer is 1. The output of channel1 is an 8 * 4 image, channel2 is 4*8 and channel3 is 8*8. After the images are tiled and spliced, a data array with 128 * 1 is obtained.

The network architecture of IMC-CNN.
Each channel in IMC-CNN architecture focuses on different features. In channel 1, 2×4 convolution kernel is used to analyze the horizontal temporal information in the spectrum. In channel 2, 4×2 convolution kernel is used to analyze the vertical frequency information. In channel 3, 2 * 2 convolution kernel is used to analyze the same proportion of time-domain and frequency-domain information. In the process of three types of feature extraction, the three parallel channels are synchronous and do not affect each other. Finally, the output features of the three channels are combined and transferred to the two fully connected layers and softmax layer.
The full connection layer of the framework included two full connection layers, a 12 units full connection and a 10 units full connection which corresponded to 10 hand gesture recognition actions. In the full connection layer, the function of correction linear unit (RELU) was used as the activation function and 50% probability dropout in the process of training was used to avoid overfitting. The output of the architecture was transformed into the different gesture actions’ probability by softmax function. Softmax function was defined as:
Where P (y = i|x) represents the probability that input x belongs to class i, h (x, y i ) represents an original measure that input x belongs to class i.
The processing of the whole architecture is shown in Fig. 9. Firstly, the collected sEMG signals are denoised and segmented, and then the SPM features are extracted from all denoised and segmented sEMG signals by using STFT method. PCA is used to reduce the dimension of SPM features, and the reduced features are fused. Finally, the fused features are sent to the IMC-CNN model for classification.

Processing flow of the architecture.
The RMS [36] of a set of values is the square root of the arithmetic mean of the squares of the values, or the square of the function that defines the continuous waveform. It is a commonly used feature that can characterize signal strength [37]. IMC-CNN was used as a classifier to compare SPM and root mean square (RMS) features in five aspects: preprocessing time, extracting feature time, training time, test time and accuracy. The comparative test was carried out in the same experimental environment with the same data set. The configuration of the experimental computer was that the CPU was Intel(R) Xeon(R) E-2276M 2.80 GHz, the RAM was 16GB DDR4 and the GPU was NVIDIA Quadro RTX 5000. The above performance indexes had passed 10 times cross validation. The comparison of calculation time and recognition result accuracy of SPM and RMS in the recognition process was shown in Table 3.
Performance comparison of SPM and RMS features
It can be seen from the above table that SPM feature classifier is superior to RMS in preprocessing time, extracting feature time, test time, especially in the accuracy of test results, except that the training time is longer than RMS. Since the training time of the model is usually a one-time operation before using, the SPM features are used as gesture recognition features.
Comparison of different structures of the model
The accuracy, sensitivity and specificity of different structures of the model were compared. The model structures were divided into four structures: single channel CNN with 2 * 4 convolution kernel, single channel CNN with 4 * 2 convolution kernel, single channel CNN with 2 * 2 convolution kernel and multi-channel CNN with 2 * 4, 4 * 2 and 2 * 2 convolution kernel. In the experiments, the sEMG data of 300 groups of MYO bracelets with 10 gestures from 6 subjects were collected.
(1) IC1-CNN
In the experiments, the IC1-CNN architecture used the channel 1 with 2 * 4 convolution kernel and the structure of full connection layer and softmax layer behind it in IMC-CNN. The input of IC1-CNN architecture remained unchanged, which is the same as that of the original IMC-CNN. The IC1-CNN architecture is shown in Fig. 10 (a).

The single channel model of three different convolution kernels of IMC-CNN model.
(2) IC2-CNN
Similar to the structure of IC1-CNN, IC2-CNN architecture used the second channel with 4 * 2 convolution kernel in IMC-CNN and the following full connection layer and softmax layer. The IC2-CNN architecture is shown in Fig. 10 (b).
(3) IC3-CNN
Similar to the structure of IC1-CNN and IC2-CNN, IC3-CNN architecture used the third channel with 2 * 2 convolution kernel in IMC-CNN and the following full connection layer and softmax layer. The IC3-CNN architecture is shown in Fig. 10 (c).
The data sets of all six subjects were trained by using the 10-fold cross validation method. In the training process, in order to prevent everyone’s data from being trained due to mixed data operation, the data of 6 people were divided into 10 data sets according to the data of different people at different times. Since each person collected 5 times, there were 30 data sets and each data set contained 10 hand gestures. 27 datasets were used as training sets and 3 datasets were used as test set. The performance of the model was evaluated by the average of the results of 30 times. All of the statistical results were obtained by training the data set with 10-fold cross validation method. By randomly selecting a batch of data set, the problems of too much computation and slow gradient descent caused by all the training data sets being iterative trained in CNN at one time had been prevented. The randomly selecting batch training data sets method not only improved the training efficiency, improved the training speed effectively, reduced the calculation of each iteration, and found the optimal gradient descent direction faster, but also avoided the problems of over fitting and gradient disappearance. The model training process is shown in Fig. 11.

The model training process.
During training of above four models, categorical cross-entropy [38] was used as the loss function. the Adam algorithm [39] with a learning rate of 0.02 and batch size of 10 was used for optimization. Training proceeded for up to 50 epochs.
The experiments were carried out in the same experimental environment and the same data sets. The performance of the above different recognition methods from three aspects of sensitivity, accuracy and specificity were analyzed, and a quantitative analysis of the experimental results were made. the hand gesture recognition results were classified into the following four categories.
TP: right gesture samples are identified as right gesture.
FP: wrong gesture samples are identified as right gesture.
TN: wrong gesture samples are identified as wrong gesture.
FN: right gesture samples are identified as wrong gesture.
The following three performance indicators were used to judge the performance of the above different recognition structures of the method and the experimental results were shown in Table 4.
Performance indicators comparison
The experimental results show that the sEMG gesture recognition method based on IMC-CNN has better performance than other structures in terms of accuracy, sensitivity and specificity.
Firstly, the recognition accuracy of this method was compared with classical machine learning method (LDA and SVM), LCNN and EMGNet recognition method by using MYO Bracelet sEMG data. In the experiments, time domain characteristics were selected as the features of LDA and SVM methods. The features of the sEMG data were not extracted, and processed for the LCNN method directly. Continuous wavelet transform was used to process the sEMG signal as the input of EMGNet.
Secondly, the sEMG data from NinaPro DB5 [40] were used as input to compare the average recognition accuracy of this model with other methods mentioned above.
The average recognition accuracy of above methods by using MYO datasets was shown in Table 5 and Fig. 12.
Average accuracy and recognition time of different methods on MYO and NinaPro DB5 datasets
Average accuracy and recognition time of different methods on MYO and NinaPro DB5 datasets

Comparison of methods accuracy on MYO and NinaPro DB5 datasets.
Compared with other methods, the proposed IMC-CNN model has better recognition accuracy on both MYO dataset and NinaPro DB5 dataset as shown in Table 5 and Fig. 11. The average recognition accuracy of each method on the NinaPro DB5 dataset is lower than that on the MYO dataset because NinaPro DB5 has a relatively small amount of data of each gesture and more classification. The NinaPro DB5 dataset has 12 gestures and the MYO dataset has 10 gestures. Although the recognition time of the proposed method is slightly longer than that of SVM and LDA, the recognition accuracy is much higher than that of the two methods.
In order to improve the convenience for people with hand disabilities, this paper proposes a hand recognition method using sEMG data based on IMC-CNN model. The proposed IMC-CNN model uses different convolution kernels to fuse the time-frequency characteristics of sEMG signals of 10 different gestures. The experimental results show that the method has high accuracy, sensitivity and specificity, reaching 97.50%, 97.25% and 96.25% respectively. The proposed method can recognize 10 most commonly used gestures accurately. The method proposed in this paper not only has high average recognition accuracy on MYO dataset collected by ourselves, but also has high average recognition accuracy on NinaPro DB5 dataset. Although the recognition accuracy on the NinaPro DB5 dataset is slightly lower than the collected MYO dataset, it is because NinaPro DB5 has a relatively small amount of data of each gesture and more classification.
The recognition accuracy of 10 gestures in proposed method were compared with those in other methods mentioned above, and the comparison results are shown in Table 6. The experimental results also showed that the proposed method had high recognition accuracy.
Accuracy comparison of ten kinds of gestures
Accuracy comparison of ten kinds of gestures
In the next step, the model parameters will be optimized to improve the recognition accuracy, especially on datasets with relatively small amount of gesture data. At the same time, more gesture categories will be added and more public sEMG datasets will be tested. In addition, the dynamic gesture tracking will be further studied.
The gesture recognition method based on IMC-CNN architecture has obvious advantages in accuracy and efficiency. Because the convolution kernel of single channel is small and the input information is little, its performance index is also the lowest. The performance of IC1-CNN, IC2-CNN and IC3-CNN models are lower than that of IMC-CNN. The experimental results show that the IMC-CNN with different convolution kernels can effectively improve the classification accuracy by combining the time direction, frequency direction and time-frequency direction of the spectrum. In addition, in the aspect of gesture recognition, the performance of single channel CNN is better than other methods mentioned above in three aspects on both collected MYO dataset and NinaPro DB5 dataset. IMC-CNN model that uses the improved spectrum features of sEMG signal can update the parameters automatically through iterative training and back propagation, which improves the recognition accuracy effectively. At the same time, the efficiency of training is effectively improved through the 10-fold cross validation method to select batch training data sets randomly. The experimental results show that the training time and running time of gesture recognition are acceptable. Although the recognition accuracy of proposed IMC-CNN model on the NinaPro DB5 data set is slightly lower than collected MYO dataset, it is because NinaPro DB5 has a relatively small amount of data of each gesture and more classification. It is of great significance to use sEMG signal for gesture recognition, and further realize the control of prosthetic hand or dynamic tracking control of gesture.
Declarations
Funding
This work was supported in part by the National Key Research and Development Program of China (2017YFB1401200); in part by the Program for Top 100 Innovative Talents in Colleges and Universities of Hebei Province (SLRC2017022); in part by the Hebei Province Postdoctoral Scientific Research Project (B2019005001); and in part by the Key Project of Hebei Province Department of Education (ZD2020146); and in part by the Natural Science Foundation of China (61703133 and 61673158).
Conflicts of interest
The authors declared that there is no conflict of interest in whole or part of the paper.
Code or data availability
Not applicable.
Authors’ Contributions
Jun Li contributed to the design of the experiment and collected related data; Lixin Wei contributed to verify the experiment and put forward some corrections; Yintang Wen contributed to collect related references and check the results of the experiment; Xiaoguang Liu contributed to analyze most of the data, and wrote the initial draft of the paper; Hongrui Wang contributed to refine the ideas, carry out additional analysis and write this paper.
Ethics approval
Not applicable.
Consent to participate
Not applicable.
Consent for publication
Not applicable.
