Abstract
Aiming at the problem that hand gesture recognition difficult issues in the real scene, a method of hand gesture recognition that combines skin color with SVM was proposed. The skin color area was separated by Otsu adaptive threshold algorithm in the YCbCr color space, and hand gesture was segmented by hand gesture area criterion. Hu moment features and finger number were extracted on the hand gesture contour as the feature vector. Six common static gestures were classified and recognized by SVM classifier. Experimental result showed that this method had good stability and real-time performance, average recognition rate could reach 94%. On the hand gesture recognition application, the hand gesture recognition results were converted into instructions, which achieved the real-time controlled simulation of the NAO robot in Webots simulation environment, and verified the feasibility of hand gesture recognition algorithm.
Introduction
With the rapid development of information technology, human-computer interaction technology occupies an increasingly important position in people’s life. Nowadays, in order to meet the needs of people’s life, hand gesture recognition as a kind of natural and humanized human-computer interaction way is increasingly adopted [1]. At present, many scholars have done a lot of work on hand gesture recognition technology. Such as Yang et al. [2] proposed a gesture recognition algorithm which was based on the main direction of gesture and Hausdorff-like distance template matching, and solved the problem of hand gesture recognition which was affected by hand gesture rotation, translation and scaling. But the experiment can only be carried out under the condition of the light stability, less noise and no human face interference. Dardas et al. [3] extracted the image scale invariant features and vectorization features, and then used bag-of-features and support vector ma-chine techniques to recognized hand gesture, and the recognition effect is better. However, due to the high computational complexity of SIFT algorithm, the recognition speed is slow and the real-time performance is poor. Tao et al. [4] trained image patch by the unsupervised sparse self-coding neural network, and extracted the edge features of hand gesture image as the input of the training classifier. Finally, they adjusted the parameters of the trained classifier to improve the accuracy. But it can be realized only in the limited background, which cannot be recognized under the realistic background.
In previous research, in order to improve the recognition rate, the hand gesture is usually recognized using a background-limited or a simple background method [5, 6]. It is impossible to exclude interferences such as face, illumination change, and similar skin color, etc. Therefore, it is not conducive to natural human-computer interaction. At present, there is no mature hand gesture recognition system that can be widely used in the real environment. Therefore, it is of great theoretical and practical value to improve the human-computer interaction.
Hand gesture recognition design process
In the real environment, hand gesture recognition is difficult due to complex background, light change and hand shape difference. In order to solve these problems, this paper proposes a method of hand gesture recognition based on skin color and SVM (Support Vector Machine), and this method is applied to the robot NAO control. The process of specific design is shown in Fig. 1.
Hand gesture recognition and its application flow chart.
Firstly, the hand gesture image in real environment is collected through the camera. After pretreatment, the hand gesture is detected and segmented. Then the hand gesture features are extracted and the hand gesture recognition is performed by using the classifier. Finally, the hand gesture recognition results are transformed into instructions for robot real-time control simulation.
Color space conversion
In hand gesture detection, there are three commonly used skin color spaces: RGB, HSV and YCbCr. Experiments show that the chrominance and luminance of YCbCr color space are separated from each other, and the
According to the calculation statistics, the
Among them,
After image normalization, the image is transformed into gray scale image with skin color similarity. Then Otsu dynamic adaptive threshold algorithm is used to segment the skin area. Otsu threshold method is a standard of measuring the biggest difference between target and background, namely, calculate the variance between them. When the variance reaches the maximum, the threshold is used as image segmentation threshold. When the between-class variance (
Although the Otsu threshold segmentation algorithm can achieve accurate segmentation of the skin color region, the face and skin-like color objects still exist in the image, so we should try to remove the non-hand area. For the skin color area after binarization, set the hand gesture area determination condition as shown in Fig. 2.
Hand gesture area determination flowchart.
In this paper, the hand shape of the closed outline is analyzed and recognized, and only the case of wearing long-sleeved clothes is considered. Therefore, in the image, there are only face and hand in the skin area of the human body. After image binaryzation processing, many experiments have shown that if some small area skin color areas (it may also be a skin-like color area) account for less than 0.02 of the whole image, then these areas are not human hands or faces, and they are eliminated. In Fig. 2, contArea represents small area skin color areas, and imgArea represents the whole image area. The remaining skin color areas are only face and hand. Calculates the ratio (K) [8] of the height and width of the skin color area, and if it is within the range of [0.7, 3.0], it is the gesture area. To recognize a gesture, the full gesture shape must appear in the window screen. If the skin color area of the image is connected to the collection window, no processing is done, whether it is a hand gesture or not, because the incomplete case will result in a misjudgment of the recognition result.
As shown in Fig. 3, the original image (a) is processed into an image (b) after binary processing, remove the non-gesture area and get the gesture image (c).
Image binary processing.
Hand gesture contour extraction
After obtaining the complete binary image, we need to find the appropriate and accurate features to describe the hand gesture, and at the same time, we should try to minimize the amount of computation. Therefore, this paper chooses the contour of the gesture area to be processed, that is, to extract the shape of the object.
In this paper, OpenCV function library is used in image processing. The gesture contour is retrieved from the binary image using the library function cvFindContours. When mode
Feature extraction
In feature extraction, 7 Hu contour moments and fingertip numbers are extracted from hand gesture contour.
Hu contour moment
Hu invariant moment has the characteristics of image translation, rotation, and scale invariance. Compared with the commonly used silhouette moment [9], this paper only calculates the boundary moments of gestures, which can reduce the computation time and reduce the storage space.
Fingertip detection
Through multiple fingertip detection experiments, it is concluded that the fingertip number can be accurately obtained by combining fingertip curvature and gesture convex hull. The fingertip curvature is the position of the contour change larger [10], the outline convex hull is the smallest external convex polygon on all the vertices of the gesture contour [11], and its vertices are all the points on the contour.
Firstly, all the fingertip candidate points are determined by curvature algorithm, and convex hull algorithm is used to find the convex hull of the hand contour. Comparing the convex hull vertices with the candidate points, we get the finger points and count the fingertip numbers. The fingertip detection effect of gesture in video is shown in Fig. 4.
Fingertip detection effect diagram.
According to the above analysis, the feature vectors for hand gesture recognition are composed of seven Hu moment features of Hu1
After the gesture is segmented, the contour is extracted from the gesture area. Then use the Hu invariant moment formula to calculate the 7 Hu moments of the gesture contour. The number of fingers is detected by gesture contour curvature and convex hull algorithm. The 7 Hu invariant moments and the number of fingertips are combined to form a feature vector.
The recognition object uses 6 commonly used static gestures, as shown in Fig. 5.
Feature vector of gesture 5
Six kinds of gestures.
In gesture recognition, the sample of hand gestures is limited, and a classifier for accurate classification of finite samples is needed [12]. Therefore, the support vector machine (SVM) proposed by Vapnik is used as a classifier for training samples and recognizing samples.
Database establishment
The database is built in 3 environments, with 5 experimenters, 6 common gestures, 10 times for each gesture, 900 samples, half as a training sample set and the other half as a test sample set. Each gesture collects 150 gesture images in various situations such as blurring, different backgrounds, angle rotation, size scaling and so on. Due to the length of this paper, the contour image and feature vector values of the gesture 5 under ten conditions are given, as shown in Fig. 6 and Table 1.
Ten cases of gesture 5 and its contour image.
After establishing 6 gesture databases, SVM classifier is used to train and classify hand gesture samples respectively. In the classification process, the specific steps to use the LIBSVM software package are as follows:
First, the sample set data format is converted. The feature data of the 450 sets of training samples are stored in train_hand.txt, and the remaining 450 sets of feature data of the tested samples are stored in test_hand.txt. The svmscale function is used to scale the data, so that the range of feature value is unified. Then, train and classify the samples. svmtrain function is used to train sample data to get the optimal parameter to train the classifier, and the training model is hand.modle. The test sample set (test.txt) is classified and recognized by svmpredict function and training model.
Support vector machine (SVM) and template matching method were used to recognize gesture, and 450 images in the experimental database were tested. By comparing the experimental results obtained by the two methods, the average recognition rate and average recognition time of 6 commonly used static gestures are obtained.
6 kinds of hand gesture recognition rate
6 kinds of hand gesture recognition rate
Experimental results of two methods
From Table 2, we can see that the gesture recognition rate based on SVM is significantly higher than that of template matching method, and it can meet the requirement of the recognition rate. The recognition rate of gesture 4 is slightly lower. Through experimental analysis, most of the misjudged samples were recognized as gesture 3 and gesture 5. This is because in some cases, the Hu moment features of gesture 4 is similar to that of gesture 3 and gesture 5, difference between them is relatively small, resulting in misjudgment. In addition, the degree of specification of the gesture sample and the complexity of the gesture background are also the reasons for the misjudgment.
From Table 3, we can see that the hand gesture recognition algorithm based on SVM can ensure that the experimental results have a high recognition rate. Although the running efficiency is lower than that of template matching, it can also meet the real-time requirement and ensure the robustness and stability of gesture recognition system.
In the Webots robot simulation environment, hand gesture images are captured through the camera, and hand gesture recognition is performed using the algorithm mentioned above. Then the recognition results are transformed into control instructions and transmitted to the robot NAO, calling the relevant API functions of NAO and making corresponding actions. The gestures are used to control the robot in real time, enabling the robot to move forward, turn back, turn left, turn right, sit down and stand up.
In order to develop the algorithm in Webots environment with the characteristics of strong portability, short development cycle and high efficiency, the algorithm is developed based on third party computer vision library OpenCV. Transplant OpenCV to Webots environment and configure Webots compiler environment.
The simulation of the “world” is equivalent to the real world. It is required to set a certain number of steps when making “Move forward” gesture instruction (the maximum step is 16 steps in this paper), then automatically stop and execute the next instruction. When issuing the “Sit down” instruction, the next step is to issue “Stand up” instruction, and the system will automatically block other gesture commands. The time interval between the gestures is set to 3S, and the gesture image captured by the system will not be processed in 3S time. This can ensure that in the process of recognizing and executing instructions, even if the experimenter mistaken some gesture instructions, the system can shield the instructions and avoid interfering with the normal work of the robot. Instruction conversion is shown in Table 4.
Translation table of hand gesture instruction
Translation table of hand gesture instruction
The simulation design of NAO robot control based on gesture recognition is as follows:
To control NAO in Webots, first open the controller, then edit the control program in the textbox, and initialize before designing the main program. The corresponding functions are as follows:
wb_robot_init();
Start the body and load the “action” file. Drive body function:
Find_and_enable_devices();
Load the “action” file function:
load_motion_files();
When receiving the gesture code 1, call the forward loop function until the next gesture instruction is received, or completing the maximum step number (16 steps) and stop. Forward loop function:
wbu_motion_set_loop(forwards,true);
wbu_motion_play(forwards);
When NAO receives the next gesture instruction, that is to change the current movement, then it needs to stop the current action and execute the new action. Interrupt the current action:
wbu_motion_stop(currently_playing);
Execute new action: (motion represents the new action instruction)
wbu_motion_play(motion);
currently_playing In the NAO turn design, the angle parameter of left turn and right turn is set to 90 degrees, and the angle parameter of backward turn is set to 180 degrees. For example:
start_motion(turn_left_90);
start_motion(turn_right_90);
By calculating the response time of the robot after receiving instructions, the average response time of the system to six gesture instructions is about 65 ms. It has good real-time performance and can meet the requirement of the gesture control robot.
This paper focuses on the key technologies of hand gesture segmentation and recognition. The details include hand gesture segmentation, feature extraction, gesture recognition and gesture control robot. Finally, the stability of the algorithm and high recognition rate can be guaranteed in gesture recognition experiment, and the control simulation of the gesture to the robot also satisfies the good real-time performance.
The experiment completes the real-time control simulation of robot NAO by collecting gestures through the camera. In the next step, the algorithm will be implanted into the real robot NAO, and the robot “eyes” will replace the external camera, so that the robot can collect the gesture image and recognize it independently, then produce the corresponding action, and finally set the specific task to let the robot complete.
Footnotes
Acknowledgments
This work is supported by technology plan public project of Zhejiang Province in China (No. LGG18F040002), and supported by Natural Science Foundation of Zhejiang Province in China (No. LY19F020035). The authors thank the members of the Center for Advanced Life Cycle Engineering at the University of Maryland for their support of this work.
