Abstract
A new algorithm for static hand gesture recognition is proposed in this paper, which mainly includes the following four steps: hand segmentation, arm removal, feature extraction and gesture recognition. Firstly, the hand is extracted from the background by using skin-color features and geometric characteristics. Secondly, a new arm removal algorithm is proposed, which can effectively and quickly remove the arm area by using distance transformation operations, and gesture composed of palm and fingers can be obtained. Finally, Hu moments of the gesture image and the number of fingertips are calculated and entered into the Support Vector Machine (SVM) for training. Experiments have been performed to demonstrate that the proposed algorithm is robust in complex background, and can detect and recognize gestures in real time with an accuracy of 94.89%.
Introduction
With the development of computer vision, pattern recognition and other related technologies, the research of human-computer interaction technology has been greatly promoted. Compared with the traditional interactive methods such as mouse and keyboard, the gesture-based interaction [1–3] is more natural and convenient, and has been widely used in many fields, including smart home [4], robot control [5, 6], sign language recognition [7, 8] and so on.
As the core technology of most gesture interaction systems, gesture recognition technology can be divided into two categories: methods based on data glove and methods based on vision. The method based on data glove utilizes multiple sensors to acquire the position, direction, speed and other information of the gesture, and uses a specific recognition algorithm to achieve gesture recognition [9, 10]. It requires the identified person to wear data gloves. The method based on vision processes the captured gesture image or video with image processing and pattern recognition. It has been attracting tremendous interest of the majority of scholars due to the advantages of low-equipment-requirements, natural interaction [11–15].
In general, vision-based gesture recognition can be divided into two categories: (1) Static gesture recognition [16–18]. Static gesture focuses on the specific shape of the hand at a certain time, and the task of the recognition system is to determine the meaning of the hand shape. (2) Dynamic gesture recognition [19–22]. Dynamic gesture refers to the movement of the hand during a period of time, and the task of the recognition system is to understand the meaning of the whole process of movement. These two kinds of gesture recognition have different applications in practice, and the research methods are also very different. This paper focuses on the study of vision-based static gesture recognition.
Figure 1 shows the flow of a typical static gesture recognition system. The gesture is firstly detected and segmented from the background. Then, some features are used to represent the gesture, and the extracted gesture features are recognized using a classifier. How to recognize the gesture in the input image quickly and accurately is the key problem that the gesture recognition system needs to solve.

The flow of a typical static gesture recognition system.
The combination of palm and fingers can fully describe the information of a static gesture. In human-computer interaction, people may do the same gesture with their arms exposed or not exposed. The arm area existing in Fig. 2 is redundant information, which will interfere with the recognition of gesture. Original images obtained in different situations will result in different hand segmentation results. Since the removal of the arm area contributes to the recognition of the gesture, it is necessary to judge whether the arm area exists in the segmented image and further remove the existing arm.

The composition of hand (Bare arm situation).
Many methods for static hand gesture recognition only consider the situation of no arm exposed, and the gesture images used for training and testing do not contain the arms. Without considering arm interference, these methods are difficult to be popularized. Some measures can be taken to avoid arm interference, such as wearing particular colored gloves [23] and controlling the camera’s perspective, but the naturalness and practicality of gesture interaction will be reduced.
Some algorithms for arm-area removing have been studied. Arm removal algorithms proposed in [15, 24] are based on the characteristics that the wrist is narrower than the palm and the arm. In [15], the position of the horizontal wrist line is determined by searching the narrow boundary width, and the area below the line is removed. This algorithm has some limitations because of its requirements that the palm is located just above the image and the arm is just below the image, so it is not suitable for special situations such as hand tilted. The wrist line is determined by searching the spindle of the minimum enclosing rectangle of the hand in [24]. Although this algorithm is not affected by the rotation of the hand, it cannot remove the arm quickly due to the high computational complexity. An arm removal algorithm based on morphological opening operation is proposed in [12]. Owing to the large-diameter structural element, the arm area cannot be completely removed, thus affecting the gesture feature extraction and classification results. A cutting line is used for arm removal in [11], which is determined by distance transformation and vector dot product operations. Although the algorithm can effectively remove the arm, its time complexity is too high to meet the requirements of real-time system.
Combined with depth information (the distance information between the object and the camera), the method proposed in [25] can find the circle or ellipse that fits the palm region, and extract the palm as well as fingers successfully. Some gesture recognition algorithms based on depth information have been developed [18, 26–28], in which 3D sensors (such as Kinect) are required for depth information acquisition. However, there are still some limitations since 3D sensors are not always applicable to many systems, such as Google Glass. Besides, the computational complexity increased by depth information will lead to lower recognition efficiency.
To solve the above problems, based on web camera, a new algorithm for arm-area removing is developed in this paper. Distance transformation operation is used twice on the binary image obtained after hand segmentation. Combined with the geometric characteristics of the hand, it realizes accurate positioning of the palm and complete removal of the arm. Moreover, the algorithm has low computational complexity and is not affected by the rotation of hands. It is proved that the static gesture recognition system based on the arm removal algorithm has high recognition rate and good robust against complex environment.
The flow of the static gesture recognition algorithm in this paper is shown in Fig. 3, including the following main steps: first, hand segmentation is applied to the processing of the original image, thus a binary image containing only the hand is obtained; second, the proposed arm removal algorithm is used to obtain a gesture image composed of the palm and fingers; third, feature vector, composed of Hu moments and the number of fingertips, is extracted to describe the gesture; finally, SVM is used to classify the input feature vector, and achieve gesture recognition. How to remove the arm interference accurately and quickly is the focus of this paper.

The proposed static gesture recognition algorithm.
Illumination variations and noise will affect the skin color segmentation operated on the original image. Before the establishment of the skin color model, Gaussian filter and illumination compensation are adopted for the image to reduce the noise and differences of grayscale distribution. Nonlinear transformation is used as illumination compensation method to adjust the brightness of the original image, which can bring the overall brightness of image closer to the intermediate brightness, thus overcoming the effect of illumination variations to some extent. YCr’Cb’ space, a color space in which chrominance and luminance are separated, has the advantages of strong skin color clustering and little influence by external illumination [29]. Converting the image from RGB color space to YCr’Cb’ space, threshold operations are adopted on chrominance components to realize skin color segmentation, thereby avoiding the influence of luminance. As shown in Fig. 4(b), the skin color regions and the skin-like regions in the image can be detected and segmented. Due to the influence of background factors, the segmented binary image often contains some noises. Morphological processing is used to eliminate the isolated noises.

Hand segmentation.
It is difficult to predict the changeable background in human-computer interaction. Therefore, it is assumed that the algorithm is applied to the complex background, where all skin-like objects are smaller than face and hand. The two connected regions (face and hand) with the largest area are extracted, and the small holes inside the areas are filled to obtain the ideal binary image composed of face and hand. The shape complexity C is defined as Equation (1).
Where A is the area of the region and L is the perimeter of the region. It takes the maximum value in the circular region. In this paper, the C value of the face will be greater than that of the hand. If the C value of the remaining connected region is greater than 0.3, the region is considered to be a face and removed. The result is shown in Fig. 4(c).
Considering arm exposed and arm not exposed, the arm removal algorithm is divided into two sub-algorithms: discrimination algorithm for the existence of arm area and removal algorithm for existing arm area.
Discrimination algorithm for the existence of arm area
For the hand segmentation image, combined with geometric characteristics of hand, distance transformation operation is used twice to accurately discriminate the existence of arm area.

The elements of binary image.
Using distance transformation operations, the method of detecting the existence of arm area is as follows:
A distance transformation map (shown in Fig. 6) is obtained by applying distance transformation operation on the hand segmentation image. Due to the geometry of hand, palm center is always the brightest pixel in the distance transformation map (with or without arm area). This method traverses the distance transformation map, taking the largest pixel value as radius R0 of the inscribed circle of palm, and the corresponding pixel P c as palm center, thus achieving accurate positioning of palm center.
In order to reduce the impact of palm on subsequent processing, the value of pixel whose distance from P c is less than R1 is replaced by value 0. In this paper, the radius R1 of the circle that cuts palm is taken as R1 = 1.35 × R0. Considering two situations, according to the composition of the remaining foreground regions, the binary image obtained after the above processing can be divided into the following four cases:
– case 1: arm and fingers (when arm exposed)
– case 2: arm, fingers and part of the palm (when arm exposed)
– case 3: fingers (when arm not exposed)
– case 4: fingers and part of the palm (when arm not exposed)
Distance transformation map.
An another distance transformation operation is taken on the image obtained above. The shortest distance from the pixel in the remaining region to the boundary can be obtained. And the maximum distance value in the distance transformation map is detected, which is taken for P max and corresponding to the pixel P. Fig. 7 shows the processing of the four cases.

The processing of four cases.
For case 1 and case 2, P should be located in the arm area and generally located at the midline of the arm, and P max is large; For case 3, P should be located in the finger area and generally located at the midline of the finger, and P max is relatively small; For case 4, P is located in the finger or part of the palm area, and P max is small.
From the above analysis, it can be known that the value of P max is related to the existence of arm area. When the arm is exposed (case 1 and case 2), the calculated P max is large. And when the arm is not exposed (case 3 and case 4), the calculated P max is small.
If
Eight-connected discrimination algorithm and XOR operation are used for arm removal in this paper. The process of removing the arm area is shown in Fig. 8.

The process of removing the arm area (⊕ represents XOR operation).
For the image obtained in
Hu moments, composed of 7 invariant moments, have the property of keeping their values constant when gesture moves, rotates, and scale changes, which are suitable for describing the overall shape of the gesture. In addition, Hu moments are less affected by noise and can be applied to static gesture recognition in complex background. It is proved that the combination of multiple features can improve the accuracy and robustness of the algorithm [30]. Therefore, the extracted Hu moments of gesture image and the number of fingertips are combined into an 8-dimensional feature vector, which is used as training sample or testing sample for SVM.
To calculate the number of fingertips, a circle is drawn on the gesture image, and the pixels within the circle are all 0. In order to keep the palm inside the circle, P c is chosen as the center and R2 = 1.95 × R0 is chosen as the radius. As shown in Fig. 9, only finger regions remain in the binary image, whose number can be calculated using eight-connected discrimination algorithm.

The number of fingertips.
SVM is a pattern recognition method based on structural risk minimization, which has many unique advantages in solving small sample and nonlinear pattern recognition problems. The static hand gesture recognition problem in this paper belongs to the small sample and nonlinear classification problem, so SVM with excellent classification performance is selected as the classifier. First of all, the input feature vectors need to be normalized. In the training phase, the RBF kernel function is selected and the parameters are optimized using grid traversal method, and finally a SVM model is obtained. In the testing phase, the feature vector is input into the SVM model to obtain the final recognition result.
Experimental results
The five gestures in Fig. 10 are used to verify the effectiveness and feasibility of the proposed algorithm. All the experiments are performed in MATLAB2013b on a computer with Intel(R) Core(TM) i5-4440(3.10GHz), 8G memory, and Windows 7 operating system. A monocular web camera, with a resolution of 640 × 480, is used to collect five hand gesture images of three people in complex background, thus 450 testing images (90 images for each gesture) and 1500 training images (300 images for each gesture) are created, which constitutes dataset 1 (shown in Fig. 11). The gesture dataset 1 contains gestures with arms exposed or not exposed, and some examples (after hand segmentation) are shown in Fig. 12. Based on the same dataset, combined with the proposed methods of hand segmentation, feature extraction and recognition, the arm removal algorithm in [11, 12] are used to demonstrate the advantages of the proposed algorithm in improving system recognition rate and the speed of image processing.

Five gestures used for experiments.

Examples of the gesture dataset 1.

Examples of the gesture dataset 1 (after hand segmentation).
As shown in Fig. 13, five hand gesture images are processed by the proposed algorithm. Under different illumination, the hand segmentation algorithm can detect and extract the hands accurately. The arm removal algorithm based on distance transformation can remove the arm interference and obtain ideal gesture images. Besides, it is not affected by the rotation of the hands. On this basis, a gesture recognition system with a recognition rate of 94.89% is realized by extracting gesture features and training SVM classifier.

The processing of the proposed algorithm.
Table 1 shows the recognition results of the proposed algorithm performed on the testing set. From the table, we can conclude that the proposed algorithm was very accurate for gesture 1, gesture 2 and gesture 3. Slightly less accurate were gesture 4 and gesture 5. Out of 450 hand gesture images tested, the proposed algorithm was able to recognize 427 gestures correctly, giving an accuracy of 94.89%.
Recognition results of the proposed algorithm t1
The recognition results histogram of different gesture recognition algorithms on the same dataset are shown in Fig. 14. From the histogram, we can see that without removing the arm area, the recognition rate of the proposed algorithm on the testing set is only 89.67%, verifying the effect of arm redundant information on the recognition result.

Gesture recognition performance using different arm removal algorithms on dataset 1.
Using the arm removal algorithm in [12] to deal with the five images, which are obtained after hand segmentation (shown in Fig. 13), and the results are shown in Fig. 15. For gesture 1, gesture 2 and gesture 4, there are many independent noises in the processed images, which will affect the feature extraction and gesture recognition. For gesture 3 and gesture 5, since the diameter of the structural element is too large, the processed images do not succeed in removing the arm area. Using this arm removal algorithm, a gesture recognition system is constructed with a recognition rate of 89.11% (shown in Fig. 14).

The arm removal algorithm in [12].
Compared with not removing the arm area, incorrectly removing the arm area will increase the interference of the gesture recognition and affect the recognition result. From the histogram (shown in Fig. 14), it can be seen that the recognition rate of [12] is even 0.56% lower than that of the proposed algorithm without removing the arm area, and 5.78% lower than that of the proposed algorithm. Although the arm area needs to be removed, it is also crucial to remove the arm area accurately.
As shown in Fig. 16, the arm removal algorithm in [11] is used to deal with the same five images. The algorithm [11] uses the distance transformation map and the palm cutting circle to determine the midline of the arm, and traverses the edge of the hand to find the arm removal line that is perpendicular to the midline of the arm (vector dot product operation is 0), and then removes the arm area. Compared with [12], [11] can accurately remove arm interference. The gesture recognition system constructed using the arm removal algorithm in [11] has a recognition rate of 91.33%, which is still 3.56% lower than the proposed algorithm (shown in Fig. 14). However, this recognition rate is 2.22% higher than using the algorithm in [12] and 1.66% higher than using the proposed algorithm without removing the arm area.

The arm removal algorithm in [11].
Based on the same methods of hand segmentation, feature extraction and recognition, the calculation time of the gesture recognition system built in the experimental part largely depends on the complexity of the arm removal algorithm. In this experiment, 250 training images (50 images for each gesture) were randomly selected from dataset 1 to form dataset 2. Different arm removal algorithms were used to deal with dataset 2, and the calculation time is shown in Table 2.
The calculation time of different arm removal algorithms performed on dataset 2
The calculation time of the arm removal algorithm in [11] is closely related to the number of pixels in the foreground area of the input hand image. The more pixels in the foreground area, the longer the algorithm takes. Therefore, the processing time of the algorithm will be very different for the presence or absence of the arm area in the hand image and the distance between the hand and the camera.
As can be seen from Fig. 17, when the arm removal algorithm in [11] is used to process the five gestures of the dataset 2, the average processing time of the algorithm fluctuates greatly for different gestures. The average processing time for each image is 1.2045 s (shown in Table 2). Although the recognition rate of the algorithm is as high as 91.33%, the algorithm is still not suitable for a real-time system with high stability requirements.

Average processing time of different arm removal algorithms.
Using the arm removal algorithm in this paper and [12], the five gestures of the dataset 2 are processed, and the average processing time corresponds to the two gentle curves in Fig. 17. The proposed arm removal algorithm takes about 0.1137s to process each image, which is less than half of the average processing time of the arm removal algorithm in [12] (shown in Table 2).
Furthermore, the proposed algorithm can be extended to other gestures (shown in Fig. 18), not limited to the five gestures in Fig. 10. Different gestures can represent different numbers, or can be mapped to different instructions. And gesture recognition system realizes the human-computer interaction by recognizing the meaning of gestures.

Examples of gestures for the proposed algorithm.
The hand gesture recognition algorithm proposed in this paper shows good performance on both datasets including multiple users. The hand segmentation method combining skin color features with geometric characteristics is shown to realize accurate hand tracking, and robust against complex backgrounds and illumination variations. The proposed arm removal method, with low computational complexity, achieves the complete removal of the arm area and is unaffected by hand rotation. A gesture recognition system with a recognition rate of 94.89% is implemented based on the arm removal method, which can effectively solve the problem of low gesture recognition rate and slow processing speed in complex background.
Footnotes
Acknowledgments
The work is supported by The National Natural Science Foundation of China (U1401252, 61871188), The National Key R&D Program of China (2018YFC0309400), The Fundamental Research Funds for the Central Universities SCUT (2017MS062).
