Abstract
In the present study, we sought to enable instant tracking of the hand region as a region of interest (ROI) within the image range of a webcam, while also identifying specific hand gestures to facilitate the control of home appliances in smart homes or issuing of commands to human-computer interaction fields. To accomplish this objective, we first applied skin color detection and noise processing to remove unnecessary background information from the captured image, before applying background subtraction for detection of the ROI. Then, to prevent background objects or noise from influencing the ROI, we utilized the kernelized correlation filters (KCF) algorithm to implement tracking of the detected ROI. Next, the size of the ROI image was resized to 100×120 and input into a deep convolutional neural network (CNN) to enable the identification of various hand gestures. In the present study, two deep CNN architectures modified from the AlexNet CNN and VGGNet CNN, respectively, were developed by substantially reducing the number of network parameters used and appropriately adjusting internal network configuration settings. Then, the tracking and recognition process described above was continuously repeated to achieve immediate effect, with the execution of the system continuing until the hand is removed from the camera range. The results indicated excellent performance by both of the proposed deep CNN architectures. In particular, the modified version of the VGGNet CNN achieved better performance with a recognition rate of 99.90% for the utilized training data set and a recognition rate of 95.61% for the utilized test data set, which indicate the good feasibility of the system for practical applications.
Introduction
The hand gesture recognition process typically consists of two phases, namely, detection and recognition, with the recognition phase being further divided into the identification of both static and dynamic hand gestures, where the term “static hand gesture” refers to a fixed hand gesture and the term “dynamic hand gesture” refers to a hand gesture involving continuous motion, such as a wave or a grab.
For the first phase, that is, the detection phase, the skin segmentation method is generally utilized to segment the background and the hand, after which noise processing and the background subtraction method are applied to acquire the region of interest (ROI), i.e., the hand region. Following the introduction of the Kinect [1] depth camera by Microsoft, a number of depth information-based hand detection methods have been developed in recent years, such as those introduced by Keskin et al. [2] and Memo et al. [3], which utilize the random forest [4], a machine learning method, to train the system to detect the skeletal structure of the hand. However, because the Kinect camera and similar cameras are relatively costly in comparison to standard web cameras, and because their efficacy tends to be particularly impacted by the light source in the given location, there remain a number of limitations to their application.
Meanwhile, the recognition phase of the overall hand gesture recognition process is essentially a matter of classification. Such classification is achieved by categorizing different hand gestures into various categories, with manually set decision criteria being used traditionally and trained classification models being used more typically in recent years. With the traditional approach, the convex hull [5, 6] of the hand is used, after performing skin segmentation, to accomplish the recognition, with recognition being determined according to the number of polygon edges produced by a hand gesture and with only the numbers 1 to 5 being recognizable. For example [7], using this method to control a robot arm allows very few variations, and the method can be negatively influenced by complex background interference. Furthermore, this approach requires that the full hand face the lens, but this is not possible when performing certain complex hand gestures. Recently, various researchers have used machine-learning methods in order to train models for classification, with the different models used including support vector machines [8], hidden Markov models [9], convolutional neural networks (CNNs) [10], and recurrent neural networks [11], among others. Of those approaches, CNNs have proven to be the most popular option in the field of recognition because they produce better results than the other options, mainly because a CNN can effectively obtain the necessary feature values from an input picture, in addition to being able to effectively learn the differences between different samples after being trained with a large number of samples. Previously, the development of CNNs was somewhat limited because of insufficient hardware computing speed. More recently, however, advances in semiconductor manufacturing have yielded substantial increases in the speed of graphics processing units, unblocking the bottleneck limiting hardware processing speeds and allowing CNNs, in turn, to develop rapidly into deep CNNs, which has achieved more favorable results than other approaches. Deep CNNs produce better results primarily due to the fact that they can obtain the necessary feature values from the input images, and effectively discern the differences between different sets of samples by utilizing relatively high numbers of samples to train themselves. The AlexNet CNN [12], an object recognition network that won the 2012 ILSVR championship [13], is the most iconic of the deep CNNs. Among the outstanding deep CNNs that were selected from the previous ILSVRs held from 2012 to 2017, it should be noted that the AlexNet [12] and VGGNet CNNs [14] (particularly the latter) are more widely utilized and implemented in the industry since their network architectures are less complex and more stable than other deep CNNs, while also achieving high recognition accuracy. For these reasons, these two network architectures are better suited for use with general computing equipment (reasonably-priced equipment) to realize a variety of potential applications.
We have found, however, that if, during the standard hand gesture recognition process consisting of the detection plus recognition phases, any object or noise similar in color to the color of the hand’s skin is present in the background, then it is easy for interference to occur that leads to the wrong ROI being detected. So, in order to avoid such interference, the background selection must be confined to a specific block. For example, in the study by Han et al. [15], the background was restricted to a small desktop area, with the webcam being pointed downward toward the desktop. Therefore, we sought to ease this restriction in the present study by limiting our focus to the recognition of static hand gestures, such that the ROI of the moving object could be tracked as the system was running, with recognition being performed near instantaneously. If this aim could be achieved, then the recognition process would no longer be restricted to just a small space. Accordingly, a tracking mechanism was added in this study in between the two phases of detection and recognition in order to prevent problems caused by the presence of objects in the background with colors similar to the hand’s skin color while also allowing for the tracking of a hand that might be moving.
A schematic diagram of the overall hand gesture recognition concept proposed in this study is shown in Fig. 1. As indicated in the figure, the proposed approach is effectively a combination of three key phases, specifically, the hand detection, hand tracking, and hand recognition phases. For the first phase, the hand detection phase, skin segmentation is first applied to the input image in order to extract the extraneous background information, after which the noise is processed to lessen the small amounts of damage that are evident in some images. Lastly the background subtraction method [16, 17] is applied to determine the ROI of the hand position. For the second phase, the hand tracking phase, the kernelized correlation filters (KCF) [18] algorithm, an algorithm that has been widely applied for image tracking purposes in recent years, is utilized as the basis for calculation. In this regard, the key idea is extracting the ROI features of the target position in the first frame in order to train a model. Then, as the subsequent frame arrives, the trained model will perform the necessary calculation to predict the new position. For the third phase, the hand recognition phase, a deep CNN is applied to extract and then recognize the hand features of the ROI. More specifically, two deep CNN architectures modified from the AlexNet CNN [12] and VGGNet CNN [14], respectively, were used in this study for the purpose of comparison. We found that, after completion of the model training, a recognition rate of more than 95% could be achieved for a set of test samples.

Schematic diagram of the proposed overall hand gesture recognition concept. After detecting the ROI in the first frame captured by the webcam, the camera initializes the tracking algorithm, and the size of the ROI block is adjusted for entry into the deep CNN for recognition, with the order of these steps indicated by the blue arrows. Thereafter, the new incoming frames continue to be tracked and recognized by the tracking algorithm (i.e., the hand detection step is skipped), as shown by the orange arrows.
This study makes several contributions, which may be summarized as follows: The study proposes a novel hand recognition system in which three key processes, namely, hand detection, hand tracking, and hand recognition, are effectively combined. For the recognition step in the overall process, the study introduces two newly designed deep CNNs that are modifications of two classic deep CNNs, that is, the AlexNet CNN [12] and the VGGNet CNN [14]. These new deep CNNs can achieve sufficient accuracy in terms of recognition while also decreasing the computational load (because the size of the network is effectively reduced), allowing instant tracking recognition to be achieved. In particular, the modified form of the VGGNet CNN was found to achieve a 99.90% recognition rate for the utilized training set and 95.61% recognition rate for the utilized test set. These results demonstrate the VGGNet CNN’s high feasibility for use in practical applications. The potential of the proposed hand gesture recognition system for use in related applications such as home appliance control (in smart homes) or human-computer interactions is considerable.
The remaining sections of this paper can be summarized as follows. In Section 2, the hand detection method is described, while the hand tracking method is explained in Section 3. In Section 4, the designs of the two network architectures used for the deep CNN-based hand gesture recognition are detailed, while the experimental results for the proposed system are provided in Section 5. Finally, Section 6 provides the conclusion for the paper and discusses directions for related future research.
Figure 2 provides a schematic diagram of the hand detection process, which consists of three steps (i.e., skin segmentation, noise processing, and background subtraction).

Hand detection process.
During the first step, the position of the hand is captured. Uneven brightness is the factor that will most likely affect the detection of the hand’s skin color. This is because images are highly sensitive to changes to light, and likewise, the color of an object may change significantly depending on the type of light source that is used. An image captured by a normal camera is formed from the RGB color space and easily affected by light, hence our decision to convert images to the YCbCr [19] color space with the aim of reducing the effects of light. Because of the high separation of brightness and chroma in this approach, as well as its simple formula conversion, it results in improved execution speed, making it well suited for application in a real-time system. Since the Y value is brightness, we went with Cb and Cr (by assigning the ranges of 77≤Cb≤127 and 130≤Cr≤165 for Cb and Cr, respectively) for skin segmentation and segmented unwanted background information. After segmenting an image’s skin color, the image is binarized such that only the edges and shape of the skin-colored region are retained (as depicted in Fig. 3(b)).

(a) The input image, (b) the image obtained after skin segmentation has been applied to the input image, (c) the image obtained after noise processing has been applied to the preceding image, and (d) the ROI (contained in the green frame) identified after background subtraction has been applied to the preceding image.
Other than the skin color of the hand, the objects found in an image may also be of a color similar to that of the hand in the image. Another issue is that the hand’s skin color often becomes fragmented due to light-related changes and the presence of shadows. To address these problems, the second step involves the processing of noises with the intent of removing them and fixing some of the fragmentation found in the hand region (as depicted in Fig. 3(c)). The overall processing performed in this step includes the smoothing of the image using Gaussian Blur (to make the edges of the hand more visible) and the execution of sub-processes of erosion and dilation processing (to achieve the effect of noise removal).
After completing noise processing, the system should be better able to focus on determining the skin-colored regions of an image. However, in practice, the system may still end up detecting objects with colors that are close to that of the skin, which is why the background subtraction method is performed as the third step [16, 17] to further remove unwanted objects from the background. This is achieved by comparing the value of every pixel point in a new image with the mean value of all pixel points in the previous image; and treating a point as part of the background if the difference does not exceed a certain range. Next, the continuous region with the largest profile in an image is identified and framed to designate it as the initial region for tracking. Before tracking is performed, the system first determines if the framed region is actually the hand region by assigning an appropriate region size as the threshold value, and if the framed region is too large or small when compared to the threshold value, it will not be recognized as the hand region. The system will then perform another round of detection, and if the framed region’s value is close to that of the threshold value, this region will be recognized as the hand region, that is, the ROI (as shown in Fig. 3(d)). Finally, the ROI is tracked on a continuous basis through application of the tracking algorithm (which is explained in Section 3).
Once the ROI has been obtained, the proposed system utilizes the KCF algorithm [18] for the tracking of the detected ROI, as this algorithm allows interference from moving objects or noise from background elements of a color similar to that of the hand’s skin to be avoided. In applying the algorithm, the tracking target (that is, the ROI) is utilized as the positive sample, while circulant matrix displacement is applied in order to produce multiple samples having the same size that then serve as negative samples. Together, both the positive sample and negative samples are used to train the model. Then, when a new frame is captured, a correction calculation is applied to the trained model to determine the (potentially altered) ROI position and accomplish the tracking effect. As shown in Fig. 4, the KCF algorithm consists of two stages, namely, the training stage and the tracking stage, with the training stage being the first stage (Fig. 4(a)).

Flow chart showing how the KCF algorithm performs the overall tracking process, where (a) depicts the training stage and (b) depicts the tracking stage.
In the training stage, the ROI detected with the background subtraction method in the first frame of interest is used as the positive sample for the target tracking training. First, multiple additional training samples (specifically, negative samples) are generated from the positive sample, and each of the positive sample and negative samples is used as an input for training. The objective of training is to find a function
(1) becomes
In (4), κ (
Note that the inner-products between all pairs of the samples are stored in a n × n kernel matrix
Next, the second stage of the algorithm, the tracking stage, is performed (Fig. 4(b). In the tracking stage, the position of the ROI within the preceding frame is applied to capture an image of the ROI in a new frame when the new frame arrives, and the displacement of the ROI is determined in order to generate different samples. The new frame and the newly generated samples are then input into the trained model produced in the first stage (Fig. 4(a)), with a regression calculation then being performed (i.e., via (4)-(6)) and the position of the maximum value being used to designate the updated ROI position. Once the new target position has been obtained, the capture of the tracking target image will occur again and the step depicted in Fig. 4(a) will be repeated in order to further train and update the model, after which the system will wait for the arrival of the next incoming frame and continue its tracking of the ROI (that is, the above process will be repeated). Only when the hand is taken out of range of the camera will the system cease to run.
To define a consistent size for the input images (so that the number of neurons set for the fully connected layers can likewise be fixed), the proposed system selects the detected ROI using a width to height ratio of 4 to 5, resizes the selected ROI to 100×120, and then enters the resized ROI into the CNN so that it can accomplish real-time tracking and recognition. For the present study, two deep CNN architectures were designed, with Architecture 1 being a modified form of the AlexNet CNN [12] and Architecture 2 being a modified form of the VGGNet CNN [14]. Each of the two modifications were primarily performed in order to reduce the size of the resulting network while also enabling sufficient recognition accuracy to be efficiently accomplished.
Architecture 1 (modified version of the AlexNet CNN)
Figure 5 shows the modified AlexNet architecture, while Fig. 6 depicts its internal parameters in detail.

Architecture 1 (modified version of the AlexNet CNN).

Detailed internal parameters of Architecture 1 (modified version of the AlexNet CNN).
In this architecture, four convolutional layers are used. However, the sizes and numbers of the convolution kernels within each layer differ, with the deeper layers utilizing larger numbers of kernels so that the system can capture more features. Specifically, the four layers, in sequence, use 32, 64, 64, and 128 convolution kernels, respectively, with the sizes of the kernels in the four layers, in sequence, being 5×5, 3×3, 3×3, and 3×3, respectively. To ensure that the size of the output feature map remains consistent, the zero-padding method is applied. That is, zeros are added around the original image so that the size of the original image will be maintained and the effect of the image edge will be reduced. In addition, a rectified linear unit (ReLU) activation function is immediately applied after each layer of the convolutional layer is applied.
Pooling layers
Once the feature map has been obtained by the convolution layer, down sampling is applied to reduce the size of the sampled image to 25% of its original size; in other words, the length of each edge of the image will be reduced to 50% of the given edge’s original size. Specifically, max-pooling was applied for sampling in this study, with a 2×2 kernel and a stride of 2 being used to obtain the maximum value of the internal elements within the image and the max-pooling being applied a total of four times, such that the picture size of the final input sent to the fully connected layer was 7×8. Moreover, each pooling layer had a layer of local response normalization (LRN) added to it.
Fully connected layers
Two fully connected layers are used in this architecture, with the input parameters being set to 1024 neurons and the final output having six categories (that is, six common hand gestures), as depicted in Fig. 7. In addition, prior to inputting the fully connected layers, the dropout method [21, 22] is applied in order to lessen the problem of system over-fitting. In other words, as the training process takes place, the value within the node will, with a certain probability, be zero according to demand, at which point that node will then be unconditionally discarded and not be updated any further for the training. Lastly, all the parameters are aggregated as the network test is conducted. The use of this approach is very effective for the training of networks for which less training data is available.

Labels assigned for six common hand gestures and representative images of the respective hand gestures.
Figure 8 shows the training flowchart. Prior to the training, the random setting of the weight parameters for each layer in the network is performed to initialize the network. Then, once the training image has been input, estimations are performed by the initialized network, yielding the output result. This output result then undergoes the softmax function, after which the difference between the value of the predicted result and the actual value of the real label (that is, the error) is calculated by applying the Cross Entropy loss function. At this point, the proposed system utilizes one-hot encoded labels for the purpose of labeling the output values (with the correct category value being set to 1, and all other values set to 0), as depicted in Fig. 7. For example, in the event that the input sample is categorized as being in the first category, the actual value assigned for the sample output of the label will then be [1,0,0,0,0,0]. Once the calculated loss function value has been obtained, the system applies a selected optimizer to perform the weight value update.

Network training flowchart.
Figure 9 shows the modified VGGNet architecture, while Fig. 10 depicts its internal parameters in detail.

Architecture 2 (modified version of the VGGNet CNN).

Detailed internal parameters of Architecture 2 (modified version of the VGGNet CNN).
In Architecture 2, a total of eight convolutional layers are used. As is the case in Architecture 1, more kernels are used by the deeper layers, which makes it possible for a greater range of features to be captured by the system. Specifically, the eight layers, in sequence, use 32, 64, 64, 64, 128, 128, 256, and 256 convolution kernels, respectively, with the sizes of the kernels in the eight layers, in sequence, being 5×5, 3×3, 3×3, 3×3, 3×3, 3×3, 3×3, and 3×3, respectively. To ensure that the size of the output feature map size remains consistent, the design of this architecture is similar to that for Architecture 1 described in Sec. 4.1.1, insofar as Architecture 2 uses zero-padding and ReLU layers in the same manner.
Pooling layers
Architecture 2 uses the max-pooling mechanism in the same manner as described for Architecture 1 in Sec. 4.1.2, except that for Architecture 2, max-pooling is applied a total of five times, such that the picture size of the final input sent to the fully connected layer is 4×4. Furthermore, no LRN layer is included after each pooling layer in Architecture 2.
Fully connected layers
Consistent with the description for Architecture 1 in Sec. 4.1.3, two fully connected layers are utilized in Architecture 2, with the input parameters being set to 1024 neurons and the final output having six categories. However, in contrast with the description of Architecture 1 provided in Sec. 4.1.3, in Architecture 2, the dropout mechanism is not applied prior to the two fully connected layers; rather, it is applied to the output of each of the two fully connected layers.
Training method
The training method used for Architecture 2 is the same as that used for Architecture 1, with the full description of the method being provided in Sec. 4.1.4.
Comparison and discussion
A comparison of Fig. 6 and Fig. 10 reveals significant differences between Architectures 1 and 2. More specifically, Architecture 2 (which exhibits VGGNet characteristics) primarily utilizes very small convolution kernels (3×3) to increase network depth; it also increases the number of convolution kernels that it uses as the network depth increases, so as to achieve better fine-grained recognition accuracy. Furthermore, Architecture 2 uses ReLu (and not LRN) for all of its convolutional layers, while Architecture 1 uses LRN, which increases memory leaks when compared to ReLu. A common characteristic shared by Architectures 1 and 2 is that fully connected layers are ultimately used as the calculation results.
The six common types of hand gestures shown in Fig. 7 are used as a case study to establish six output categories for both architectures and, thereby, enable the identification of these six types of hand gestures. The two network architectures’ design methods indicate that by retraining their network parameters, they are both capable of identifying any type of hand gesture (not just the six types shown in Fig. 7, but also for all other types of complex hand gestures), as long as the total number of hand gesture types being analyzed is equal to or fewer than six. However, in the event that there are more than six types of hand gestures to analyze, the detailed design of the internal parameters of the deep CNN architecture may need to be re-evaluated. Furthermore, the KCF algorithm is very unlikely to affect the tracking results (for any type of hand gesture) as it is not related to the hand gestures.
Experimental results
In this section, the results for the two proposed deep CNN architectures are compared and analyzed. Because the depths of the two architectures and their internal parameters are somewhat different, it was expected that their recognition rates would differ to some degree.
If preprocessing can be applied to lessen the amount of unnecessary information contained in the training data used to train the network, then the training of the network can be accomplished more quickly, which will in turn enable the network to produce better results with reduced complexity. Therefore, the proposed system utilizes the skin color detection and fuzzifier method for the processing of the original data, removing most of the background to obtain the training data used to train the model.
Based on our approach, as covered in detail in the previous sections, we have broken down the system’s overall operation flow into the steps as follows:
Capture the RGB image from a webcamera and convert it from the RGB color space to the YCbCr color space. Use Cb and Cr for skin segmentation and segment unwanted background information. Use the background subtraction method to determine whether the skin-colored region has entered the field of recognition. If not, go back to step 1. Determine whether the size of the detected region indicates a hand region. If not, go back to step 1. Apply the KCF algorithm to track the detected hand region. Capture a 4×5 (width to height ratio) region of the region and resize it to 100×120. Perform skin color detection and smoothing; remove the background and noise; and use the trained deep CNN network to categorize the hand gesture being analyzed (that is, identify the hand gesture category that the hand gesture belongs to). Repeat Steps 5 to 7 to perform hand gesture tracking and recognition.
Training data
800 training images of each hand gesture were collected (meaning that, because there were six hand gestures of interest, a total of 4800 images were used to train the model), with each of those images having different angles and backgrounds (as depicted in Fig. 11). Moreover, as shown in Fig. 12, there could still, in some cases, be some amount of background information that could not be removed by the skin segmentation process. Therefore, the different backgrounds of the collected training images were also used for the training, so that the model could completely and accurately learn the required features in order to ignore extraneous information. Lastly, the model was verified through the use of 300 test images.

Examples of original training images, including information resulting from different backgrounds and angles, before preprocessing.

Examples of training images showing how some noise could not be removed by the skin segmentation process.
The network parameters and training results of Architecture 1 are presented in Tables 1 and 2 respectively. It should be noted that the original pixel value of 0∼255 was normalized to – 0.5∼0.5 after each image was pre-processed. We apply the adaptive moment estimation (Adam) algorithm to perform the weight value update, since this approach allowing the model to be trained relatively efficiently [23].
Architecture 1 network parameter settings
Architecture 1 network parameter settings
Architecture 1 training results
The network parameters and training results of Architecture 2 are listed in Tables 3 and 4, respectively. Based on the study by Krizhevsky et al. [12], for the preprocessing of each image, the three channels (i.e., RGB) of the given image were subtracted by 103.939, 116.779, and 123.68, respectively, that is, the mean values of the respective channels of each pixel for every one of the training images in the ImageNet database.
Architecture 2 network parameter settings
Architecture 2 training results
As can be seen from the result presented in Table 2 and Table 4, the model’s recognition accuracy rate was increased by the deeper network and multiple convolutions of Architecture 2, which enabled a recognition rate of 95.61% to be achieved for the test set. In addition, the data presented in Table 5 clearly show that, in comparison to Architecture 1, a fewer number of parameters are contained in the first fully connected layer of Architecture 2, hence the observation that Architecture 2’s network parameters require less storage space. Rather, the deeper network of Architecture 2 can obtain better features with an even smaller computational burden. For Architecture 2, we further compared the accuracy of hand gesture recognition for two scenarios, with the first involving the use of 256 convolution kernels for the convolutional layers Conv5_1 and Conv5_2; and the second involving the use of 128 convolution kernels for the convolutional layers Conv5_1 and Conv5_2. As shown in Table 6, the results indicate that this change in parameter value has a certain effect on the accuracy of gesture recognition, leading to a difference of close to 2%. Although the gesture recognition rate is higher (95.61%) when 256 convolution kernels are used, a higher computational load is also observed. On the other hand, the gesture recognition rate is only slightly lower (93.74%) when 128 convolution kernels are used. The balance between recognition accuracy and computational load may have to be determined based on actual application needs or requirements.
Comparison of the two architectures in terms of parameter quantity and required storage space for parameters
Comparison of the recognition rate for two Architecture 2 scenarios, with Scenario 1 involving the use of 256 convolution kernels for the Conv5_1 and Conv5_2; and Scenario 2 involving the use of 128 convolution kernels for the Conv5_1 and Conv5_2
To further highlight the advantages of the deep CNNs that we proposed, we selected the AlexNet CNN [12] and the VGGNet CNN [14] as the control group to compare. In the case of VGGNet CNNs, they can be categorized based on the number of weight layers that they utilize. Typically, a VGGNet CNN is either a VGGNet-16 CNN, which comprises 13 convolutional layers and three fully connected layers (13 + 3 = 16); or a VGGNet-19 CNN, which comprises 16 convolutional layers and three fully connected layers (16 + 3 = 19) [14]. In other words, the control group for this comparison consists of three networks, namely the AlexNet CNN, VGGNet-16 CNN, and VGGNet-19 CNN. It should be noted that Step 6 of our system’s overall operation flow has to be adjusted to accommodate the input size of these control group networks. In particular, detected/tracked region image inputs need to be resized to 227×227×3 and 224×224×3 for the AlexNet CNN and the VGGNet-16/VGGNet-19 CNN, respectively. Table 7 lists the respective recognition rates for these three deep CNNs. The test set results indicate significantly better recognition rates for the VGGNet-16 and VGGNet-19 CNNs when compared to the AlexNet CNN. A comparison of the results in Tables 2 and 4 with those in Table 7 reveal that the recognition rate for the AlexNet CNN (85.01%) is only slightly higher than that of Architecture 1 (84.99%); while the recognition rates for the VGGNet-16 CNN (95.93%) and the VGGNet-19 CNN (96.05%) are only slightly higher than that of Architecture 2 (95.61%). In all cases, the difference is less than 0.5%. Looking at the respective parameters utilized by these five deep CNNs (as shown in Table 8), it is clear that the AlexNet CNN, the VGGNet-16 CNN, and the VGGNet-19 CNN utilize significantly more parameters than Architecture 1 and Architecture 2, which highlights the effectiveness of our proposed architectures. Compared to the deep CNNs that we proposed, the VGGNet-16 CNN and the VGGNet-19 CNN utilize at least 15 times as many parameters and, consequently, require a higher computational load. That is to say, we substantially reduced the size and depth of the classic networks, while still managing to attain a similar level of recognition rate performance. This is achieved by utilizing a sufficient number of layers and adjusting the size of feature maps and the number of kernels for each layer. Tables 4 and 7 also show that Architecture 2’s recognition rate (95.61%) is significantly better than that of the AlexNet CNN (85.01%), primarily due to the fact that Architecture 2 has inherited the advantages of the VGGNet CNN. As can be observed in Table 8, Architecture 1 utilizes more parameters than Architecture 2. This can be primarily attributed to the fact that Architecture 1 utilizes a higher number of parameters at the first fully connected layer when compared to Architecture 2 (7341056 and 4195328 for Architecture 1 and Architecture 2, respectively, as shown in Table 5), and this layer accounts for a significant proportion of the total number of parameters used in a network.
Comparison of the recognition rates for the AlexNet CNN [12], VGGNet-16 CNN [14], and VGGNet-19 CNN [14], with 16 and 19 indicating the total number of weight layers found in the VGGNet-16 CNN (13 convolutional layers + 3 fully connected layers = 16 layers) and the VGGNet-19 CNN (16 convolutional layers + 3 fully connected layers = 19 layers), respectively
Comparison of the number of parameters (rounded off to the nearest million) used by each deep CNN
Above all, the test results demonstrate the effectiveness of the proposed hand gesture recognition system in performing the instant tracking and recognition of hand gestures, particularly when Architecture 2, which offers more advantages than Architecture 1, is used.
In the present study, the traditional image processing method was effectively combined with a tracking method as well as modified versions of two deep CNNs, with the resulting system providing a high rate of recognition accuracy with a reasonable computational load. Three key processes, namely, hand detection, hand tracking, and hand recognition, are effectively combined in the proposed overall hand gesture recognition system. To achieve hand detection, the proposed system utilizes skin segmentation, noise processing, and background subtraction in order to detect the ROI within the first frame captured by the webcam. To accomplish hand tracking, it then uses this initially detected ROI to determine the initial position for tracking, while also using it to train the initial model through the application of the KCF algorithm. Then, when the next frame has been input, a tracking effect is achieved by calculating regression to determine the position of the maximum value, as well as to update the model. Finally, in order to perform hand gesture recognition, the two proposed deep CNNs (specifically, the modified versions of the AlexNet CNN and VGGNet CNN) are each separately used by the proposed system to perform the extraction and recognition of the hand features of the ROI. The experimental data demonstrated the excellent performance of the modified versions of the AlexNet CNN and VGGNet CNN; that is, they allow for high recognition rates while using a significantly lower number of parameters when compared to the AlexNet CNN and VGGNet CNN. A noteworthy finding is that the modified version of the VGGNet CNN achieved a recognition rate of 99.90% and 95.61% for the training set and the test set, respectively, while using the lowest number of parameters out of all the deep CNNs that were tested (i.e. the proposed and control group deep CNNs). Given those and other results presented herein, the proposed hand gesture recognition system has been demonstrated to be highly feasible for various practical applications (particularly when the system utilizes the modified version of the VGGNet CNN), such as the control of home appliances (in smart homes) or the facilitation of human-computer interactions.
In future studies, we plan to instead use depth detection networks as the means by which an ROI is captured from images for detection and tracking. Doing so, however, will necessitate that a very large labeling database for the hand region be available, while also requiring speed and recognition accuracy levels that can handle the implementation of instant applications. Another direction for future research, meanwhile, will be examining the use of depth detection networks in detecting the skeletal structure of the hand and determining its location after an original image has first been input, with the objective being to improve the accuracy with which hand movements are recognized. However, generating training data labeling for the skeletal structure of the hand will pose challenges in terms of cost and time. Nonetheless, we believe that the aforementioned objectives are worthwhile ones for future studies that will, accordingly, be among the trends of future research.
