Abstract
This paper reports on work that explores natural gesture inputs from end-users for television use in a typical living room setting. First, we derive a set of 19 user-defined freehand gestures for regular TV control tasks. This study helps us to determine the gestures preferred by TV viewers and reduce the risk of developing an unusable, ineffective system. Then, based on this user-defined gesture set, we propose a unified framework to address specific problems in a complex real-world TV viewing environment, including 1) the automatic exclusion of many meaningless daily actions by TV viewers, 2) the capability to recognize both static and dynamic gestures simultaneously, as well as one- and two-handed gestures simultaneously, and 3) the continuous recognition of multiple dynamic gestures in the air (e.g., a channel switching gesture for channel 127). Experimental results show that our approach allows users to interact with TV-based applications more flexibly and effectively, with improved user experience and user satisfaction. Finally, we highlight the implications of our work for the design and development of related freehand gesture applications.
Keywords
Introduction
With 27 degrees of freedom (DOF), human hands are used to perform various tasks, simple or complex, in real life. Rapid technical developments in areas such as natural human–computer interaction (HCI), sensor technologies (e.g., time-of-flight camera, Kinect and Leap Motion), and motion capture techniques have brought freehand-gesture-based interaction into application domains, including virtual/augmented reality [7], mobile computing [31,36], children’s games [13], ubiquitous computing [45], and intelligent robotics [14]. Freehand refers to human beings’ natural hands. Freehand-gesture-based interaction is a method of interaction using only bare hands for HCI input; it can potentially deliver natural, intuitive, and terse but powerful human–computer interaction techniques.
In addition to PC-based applications, the smart home is one area where freehand-gesture-based interaction becomes important. When our homes are equipped with an increasing number of computational devices, methods to support additional natural user interaction with these devices becomes a challenge. Currently, network-based applications and on-demand TV programs are delivered directly to an interactive digital TV. Compared with traditional TV tasks, such as channel switching, interactive tasks in a digital TV, such as entering characters to search contents and navigating through content space, are more complex and time-consuming. Specifically, using a traditional remote control often leads to low input efficiency and usability problems, such as eye-hand separation, i.e., users need to shift their attention back and forth between the TV screen and the remote control.
To support more natural user interaction with TV-based applications and contents, freehand-gesture-based designs have been explored [8,18,21,32,33,39,45]. Most of these designs can help users focus their eyes on the TV screen while performing various command gestures in the air. While freehand gesture inputs can provide increased variety and flexibility, current approaches continue to face challenges, such as robustness of gesture recognition, capacity to address different types of gestures, continuous recognition of multiple dynamic gestures in the air, and usability of freehand-gesture-based interaction techniques in practice. These challenges have limited the broad application of freehand-gesture-based designs in TV-based applications.
To address some of these challenges, we created a living room environment and then asked users to interact with a television using various gestures. Compared with existing gestural TV systems [8,21,32] that do not allow users to create their own gestures, we propose allowing users to create their own user-friendly gestures that will maximize their comfort and ease of use. Participatory design has long been considered to be an effective method for understanding end-users’ mental models and behaviors with new interaction techniques at the early stage of system design. With this method, system designers can obtain valuable information to shape a product’s characteristics for more effective and efficient use. Gesture elicitation studies, which is a technique that has emerged from the field of participatory design, have attracted increasing attention and been widely used to collect end-users’ requirements and expectations regarding the target system by involving end-users in gesture design processes. Based on the results of a gesture elicitation study, we propose a framework for recognition of the whole set of user-defined gestures. We highlight the robustness and high recognition accuracy of our framework for continuous recognition of different types of gestures.
The remainder of this paper is structured as follows: Section 2 reviews related work. Section 3 presents the system requirements and the user-defined gesture set. Section 4 proposes a unified framework for the recognition of user-defined gestures for interactive digital TV and describes some key technologies in the framework. Section 5 describes two experiments and a comparative analysis to study the performance of freehand gestures and presents the experimental results. Section 6 discusses the contributions of our research, its limitations, and possible future research directions.
Related work
Our work primarily concerns research related to the participatory design of freehand gestures and development of gesture-based prototypes for HCI. These two aspects of our research will be considered through the literature review.
Freehand-gesture-based systems
Existing work on freehand gesture recognition can be roughly divided into two classes – static gesture recognition [7,16,30] and dynamic gesture recognition [17,20,40–42,44]. A static gesture only concerns static features, such as hand shape, size, and color, while a dynamic gesture involves dynamic spatiotemporal information, such as movement speed, direction, position, and duration of the gesture. By using computer vision and HCI techniques, both static and dynamic gestures can be mapped directly to different computer commands.
Freehand gestures have been used in various ways. Some applications simply used gestures as a “natural mouse” for such tasks as pointing and drawing [24], while other applications used gestures for more complex activities, such as navigation and object manipulation in a virtual environment [7,16] and on interactive surfaces [12].
Beyond PC-based applications, freehand gestures are also widely used in smart homes for controlling televisions, air-conditioners, etc. For example, Freeman et al. [8] created a prototype to replace a TV remote control with freehand gestures. However, this system was developed based on the window, icon, menu, pointing device (WIMP) paradigm, and still treated freehand gestures as mouse commands. The rapid development of depth sensing technologies [11,29] in recent years allows users to move beyond the traditional WIMP paradigm and desktop metaphor and to involve users in a more flexible, creative and intuitive way regarding TV-based applications by translating freehand gestures directly into interaction commands. For example, Zaiţi et al. [45], Takahashi et al. [32], and Lee et al. [21] explored the use of freehand gestures to interact with a TV set by using a time-of-flight (TOF) camera, Kinect, and Leap Motion, respectively.
However, the gestures in those systems were designed not by end-users, but by experts with significant technical experience in computer vision, image processing, and HCI. End-users have few opportunities to participate in gesture design. Such practices may lead to a disagreement between user gestures imagined by designers and actual user gestures. Similar to the vocabulary problem [9] that affects the performance of information retrieval systems, this gesture disagreement problem may lead to a risk of poor system usability and low user acceptance. To develop a user-friendly gesture system, we must carefully evaluate the types of gestures that are most commonly used and structure our design principles accordingly.
Participatory design of freehand gestures
To gain a better understanding of these issues, some researchers used participatory design methods to study user behaviors. For example, the Wizard of Oz (WOZ) method was used by Höysniemi et al. [13] to study favorite gestures for small children playing computer games. Wobbrock et al. [37] developed a user-defined gesture set for surface computing based on the results of a guessability study. Such participatory design approaches have also been used for mobile interaction [31], music control at home [23], and TV control in a smart room [33,34,45]. Research has provided empirical evidence on the benefits of practices involving end-users in gesture design. For example, by comparing user-authored surface gestures and researcher-authored surface gestures, Morris et al. [26] found that researcher-authored gestures are less memorable and discoverable than user-authored gestures.
Different from those pure participatory design methods mentioned above, another approach for freehand gesture design is to involve end-users in multiple steps of the gesture design process. For example, Löcken et al. [23] and Nielsen et al. [27] proposed eliciting multiple candidate gesture sets from end-users, rather than only one in the early stage of gesture design. Then, a benchmark test was used in the subsequent process to compare and validate the usability of different gesture candidates and refine the results. Under this approach, additional feedback from end-users can be collected and the risk of rejecting promising gesture candidates can be reduced.
However, these studies generally stopped at the stage of gesture design and definition and lack further validation of gesture recognition and user preferences in practice. To improve the research in this direction, we conducted a study of natural gesture inputs for TV-based interaction tasks. The aim of this study is to explore various problems in the process of design, implementation, and evaluation of freehand gesture systems, specifically those concerning such usability problems as
What are the most commonly used gesture techniques for TV commands such as volume control and channel changing? What is the best way to design and develop such gesture techniques for TV systems? Can those gestures can work effectively in real-world scenarios? Do users prefer those gestures for TV-based applications?
Requirement analysis and user-defined gestures
To develop a user-friendly gesture application, it is necessary to have a clear understanding of the usage context of the intended system. However, most existing research of gesture design and development focuses on concrete gesture recognition algorithms; few are concerned with gestural design principles and design specifications in practice. Compared with prior work [8,18,21,32,45], we conducted a multi-stage participatory design study in which actual TV viewers were asked to work with expert designers to design gestures for a television system.
During the first stage, we recruited 24 participants (10 males and 14 females) from a university. Their ages were between 18 and 45 (
User-defined gesture set for a smart TV system
User-defined gesture set for a smart TV system
Previous elicitation studies [23,27,31,33,34,37,45] required participants to design only a single gesture for each target task. However, these methods often suffer from the legacy bias problem [25], which refers to the phenomenon that end-users’ gesture proposals are often biased by their experience with prior interfaces and technologies such as the WIMP interfaces (e.g., a mouse) or touch-based interfaces (e.g., an iPhone). To reduce the impact of legacy bias, we conducted an improved gesture elicitation study inspired by Morris’ suggestion [25] during the second stage. During this stage, the same 24 participants were asked to design 3 gestures rather than one for each of the 19 commonly seen TV tasks for daily use based on their preferences. By doing this, we hope to derive more reliable gesture set for TV control than traditional gesture elicitation studies.
Before the study, we established an environment that mimicked a living room in a usability lab. Then, participants sat on a sofa and were shown a list of all 19 tasks on the TV screen. We used the “think-aloud” method to gather the meaning of participants’ gestures. All participants were told that they were interacting directly with the TV using a gesture recognition system. They were asked to speak loudly about what tasks they were performing. We adopted a Wizard of Oz approach in responding to participants’ gestures. An experimenter listened to what a participant said when performing a task and used a regular TV remote control to produce the result of the task. We used five cameras to capture participants’ intuitive gestures from different perspectives for later data analysis. As a result, we collected a total of 1368 gesture candidates (
Before the third stage, we conducted a brainstorming session in which 4 HCI researchers were asked to merge and group gesture candidates for each TV function. For gestures with the exact same hand shape and/or same motion trajectory, researchers grouped them into a single gesture directly. For gestures with similar characteristics in hand shape and/or motion trajectory, researchers replayed the corresponding video files we captured during the elicitation process and discussed whether and how to group gestures based on participants’ verbal explanations. For example, 19 swipe right gestures with different hand shapes for “Next Channel” can be merged into a single gesture. As a result, we obtained 252 groups of identical gestures for 19 TV tasks. During the third stage, we recruited another 24 participants (12 males and 12 females) with an average age of 31.34 years (
As shown in Table 1, the user-defined gesture set involves three static gestures (Confirm, Mute, and Menu) and 16 complex dynamic gestures. Compared with simple dynamic gestures that only concern hand motion trajectories [17,20,42–44], the complex dynamic gestures in our work involve both an initial presence of a static hand posture and the following spatiotemporal trajectories of hand movement, i.e., complex dynamic gestures assume the presence of a sequence of hold-movement-hold. For example, the dynamic gesture ‘0’ is recognized as a combination of an initial ‘index finger pointing up’ hand posture and a following dynamic gesture to draw a circular trajectory with the index finger. In addition, the user-defined gesture set involves both one- and two-handed gestures simultaneously. The three two-handed gestures are Mute, Turning on the TV, and Turning off the TV; the remainder are one-handed gestures.
In this section, end-users were asked to design gestures freely according to their preferences. The user-defined gesture set established the foundation for the remaining system development process.
From the participatory design study, we learned what gestures end-users prefer and a gestural TV system should provide. To recognize the proposed 19 user-defined gestures for TV control tasks in a natural human-machine environment, our system must meet the following requirements:
Recognize both static and complex dynamic gestures simultaneously;
Recognize both one- and two-handed gestures simultaneously;
Distinguish between the users’ meaningful gestures and their meaningless actions, e.g., a meaningful Previous Channel gesture and a meaningless waving hand action; and
Recognize multiple continuous dynamic gestures in the air, e.g., a channel switching gesture for channel 127.

System architecture.

The mean shift procedure
Based on the above requirements, we propose a unified freehand gesture recognition framework. The proposed framework, as shown in Fig. 1, consists of three main functional modules: hand segmentation, feature extraction, and gesture recognition; each addresses a specific problem. The hand segmentation module receives live video feeds from a contactless depth sensor and sends segmented binary hand images to the feature extraction module. The image vision feature is extracted and represented by using density distribution features and vector quantization techniques in the feature extraction module; it is then processed by a gesture recognition module. For dynamic gesture recognition, the system must execute hand posture and hand motion trajectory recognition modules simultaneously, and switches from hand posture to hand motion trajectory recognition by using a simple motion activated mechanism that uses two cutoff thresholds in a twin-comparison strategy [46]. To identify key gestures and reject meaningless actions, we propose a new filter model and incorporate it into the gesture recognition module, which in turn can recognize both one- and two-handed gestures simultaneously. This helps to classify static and dynamic gestures in a unified framework. We also provide a mapping table for users to define their personal mappings between gestures and TV controlling commands.
Mean shift is a robust non-parametric clustering technique that has been widely used for image segmentation and feature space analysis [5]. Because it is immune to prior knowledge of the number of clusters, we decided to use a mean shift procedure to segment human hand(s) from other complex objects in a complex TV viewing environment. The algorithm translates the kernel window with the profile
After the mean shift segmentation, some noise may remain in the binary image. The fuzzy set theory and pyramid method in morphology [47] are introduced to filter and enhance the binary image. Next, a circle fitting procedure [28] is used to detect the palm by estimating the largest fitting circle in the binary image (Fig. 2b). Similar to the mean shift algorithm, the circle fitting procedure is iteratively repeated until it isolates the palm from the remainder of the forearm (Fig. 2c).

Different stages of the hand segmentation process: a) a binary image containing a roughly segmented hand by running the mean shift segmentation procedure; b) palm detection by running the circle fitting procedure; c) refined hand region; and d) an equidistance division of the hand region.
Static feature extraction
As shown in Fig. 2, the hand shape is represented by white pixels distributed in different spatial regions of a binary image. Therefore, we can judge whether two binary images are similar by seeing if they have a similar regional distribution of white pixels. In this section, we propose a density distribution feature (DDF) method to describe the hand region in a binary image. Given a binary image, the DDF of the hand region is defined as follows:

The calculation of DDF
As shown above, the DDF of a hand region is invariant to translation, rotation, and scaling.
Conventional approaches for dynamic gesture recognition typically rely on 2D hand motion trajectories [20,28,43,44]. Due to the loss of feature information, systems developed using these methods suffer from the degradation of recognition accuracy. Compared to prior work, we directly extract 3D feature points from the input frames. Then, in the tracking phase, we connect all the centroid points of the hand region to produce a 3D hand motion trajectory.
Let
Based on the proposed procedure, a 3D motion trajectory is characterized by a feature vector set

The extended LBG algorithm
We used the 16 complex dynamic gestures (Table 1) to generate codebooks of sizes
Hand posture recognition
Static gesture recognition is often based on template matching-based methods [16,30]. In this work, we adopted the Euclidean distance method to find the similarity between samples and template images. It should be noted that the two N-dimensional feature vectors r and
We performed the normalization procedure using the Gaussian model method. Given an input gesture sample

The classification algorithm
For two-handed gestures, the system should compute the global distances
The hidden Markov model (HMM) is a popular statistical modeling method used in single hand trajectory recognition in recent years [11,29]. However, the conventional HMM has less effective modeling performance for two-handed gestures because it makes restrictive assumptions that the system generates a single process having a small number of states and an extremely limited state memory. Therefore, the two-handed input signals fail to satisfy the very restrictive Markov condition. To improve this, many new models were proposed, for example, Factorial HMM (FHMM), Layered HMM (LHMM), and Coupled HMM (CHMM). Because of the strong capability of modeling and classifying two coupled random processes (e.g., Chinese martial art and Tai Chi) [1,2,19], the CHMM is used in this paper for two-handed dynamic gesture modeling and recognition. Figure 3 shows examples of the HMM and CHMM topologies.

a) HMM topology; b) a two-chain CHMM topology.
As shown in Fig. 3, a CHMM has two chains – a left-hand chain and a right-hand chain. Some parameters for defining the CHMM are given as follows:
An observation vector,
A sequence of joint states,
Transition probability
Observation probability
Based on the above statement, the likelihood function of a CHMM is defined as follows:
Then, the CHMM parameter is estimated based on the maximum posterior probability (MAP) method. In the recognition phase, the Viterbi algorithm [35] is used to decode all gesture CHMM. Given an unknown gesture input, our recognizer attempts to identify the index of the gesture CHMM that produces the maximum probability of the observation symbol sequence and assigns it to the unknown gesture as its class label. To find the best state sequence for a given observation vector O, we need to define the quantity:
A robust gesture recognition system should be able to support multiple continuous gestures and avoid the “Midas Touch problem”, which means each user action may be captured by a camera and then identified as a gesture by the system if there is no effective filtering mechanism in place. In this work, we proposed a visual-attention based filter model to reject meaningless actions not defined in the gesture vocabulary. Compared with prior techniques that rely only on low-level velocity and acceleration information [1], the proposed model introduces context information for meaningful gesture recognition.
Suppose G is an intended meaningful gesture, T is the gesture class to which G belongs, and
It is difficult to construct such a filter model, especially in a real TV viewing environment, because there are infinite meaningless actions by the users. In this paper, all meaningless actions are modeled by a unified CHMM where the joint states of all gesture CHMMs are copied with their own self-transition probabilities
Then, the transition probability between two non-dummy joint states
Because these two dummy states (

Gesture spotting framework.
After the construction of the filter model for all one- and two-handed gestures in our user-defined gesture set, a relative entropy method [20] is used to reduce the states in the filter model and improve system performance. Finally, two filter models are embedded into the gesture spotting framework – one filters meaningless actions before a meaningful gesture and the other filters meaningless actions after a gesture (Fig. 4).
Then, we use a sliding window technique to calculate the observation probability of the gesture HMMs
Let
Figure 6 presents an example of the temporal evolution of the filter model and multiple continuous dynamic gestures of “Channel 127” in the process of recognition. In the first 21 frames, the probability of the filter model is the greatest. Dynamic gesture 1 is then detected at frame 21 and has the highest probability thereafter. After 20 frames, the probability of dynamic gesture 1 drops nearly to zero and the probability of the filter model rises to the highest. Next, the system successfully detects dynamic gestures 2 and 7 at frames 65 and 105, respectively. From Fig. 6, we can see that the filter model provides a confidence measure of whether to accept the input as a gesture or reject it as a meaningless action.

Workflow of the sliding window.

The likelihood evolution of the gesture CHMMs and the filter model for the continuous recognition of gesture Channel 127.
We created a living room environment and then conducted two experiments and a comparative analysis to study the performance of our user-defined freehand gestures in TV-based applications. The purpose of our research was to investigate the following problems:
What is the performance of the proposed gesture recognition system? Is it good enough to identify the user-defined gestures in practice? How does freehand-gesture-based input perform in comparison with the conventional remote control in TV-based applications that require complex navigation and item selection? How will people accept freehand-gesture-based input?
Experiment 1: Performance evaluation of our gesture recognition system
The main purpose of Experiment 1 was to evaluate the recognition accuracy of our gesture recognition system.
Participants and apparatus
In this experiment, we recruited 20 participants, 10 males and 10 females, from a university. Their ages were between 18 and 45 (
We created a living room environment equipped with a 42-inch SONY TV, a depth camera, a remote control, and a PC with a dual-core 2.4 GHz CPU and 4G memory. The TV and the depth camera were connected to the PC, which hosted our system to recognize the users’ gestures and deliver contents to the TV.
Procedure
Each participant performed 19 gesture samples, including 3 static gestures and 16 dynamic gestures. Each gesture was performed ten times by 20 participants. The sample collection procedure lasted approximately 2 to 3 hours for each participant.
Results
In total, 3,800 gesture samples (
The confusion matrix of static gesture recognition
The confusion matrix of static gesture recognition
Table 2 shows that the mean recognition accuracy of the three static gestures is 98.5%. The two most confused gestures are Confirm and Menu. This may be due to the similarity of the shape of the two gestures, or the inter-user differences in performing these two gestures.
As described above, each dynamic gesture in this study can be decomposed into an initial hand posture and following hand motion trajectories. Considering the possible influence that the initial hand posture could exert on the recognition of the subsequent hand motion trajectories, we compared the results of the 16 dynamic gestures with and without initial hand postures. For dynamic gestures without initial hand gestures, we only compared their motion trajectories. The experimental results are shown in Table 3.
The confusion matrix of dynamic gesture recognition with initial hand-pose (upper values) and without initial hand-pose (lower values)
Table 3 shows that the mean recognition accuracy of the 16 dynamic gestures without initial hand postures and with initial hand postures are 94.3% and 92.2%, respectively. Lower performance in recognition of dynamic gesture with initial hand posture is probably because both hand posture and hand motion trajectory recognition modules are concurrently executed. As a result, ambiguities might arise in the misidentification or un-identification of an initial hand posture.
To further validate the recognition performance, we tested the proposed system on another public gesture data set – the 10-gesture data set by Ren et al. [30]. According to Cheng et al. [4], Ren et al. were the first to collect 3D static freehand gestures with a depth camera (Kinect) and their gesture data set was quite challenging because the gestures were collected in an uncontrolled environment (cluttered backgrounds and lighting conditions) similar to the one in our study. Therefore, the 10-gesture data set was chosen as the ground truth for the static gesture comparison experiment. The experimental results are shown in Table 4.
Static gesture comparison
Static gesture comparison
Dynamic gesture comparison
Overall, our method slightly outperforms Ren et al.’s method for static gesture recognition on the 10-gesture data set. Additionally, Ren et al.’s method requires the user to wear a black belt on his/her wrist when performing gestures. In contrast, we provide a more natural freehand gesture interaction technique for people to use in a real TV viewing environment. From the experimental results, we found that the most confusion between gestures is between gestures 0 and 9, gestures 1 and 8, and gestures 4 and 5. The lower recognition accuracy may be because we used a DDF method to describe the hand region in a binary image (Fig. 2(d)). Compared with other fingers, the thumb is shorter and smaller. As a result, two gestures with and without a thumb may not be distinct enough in a segmented binary image.
For dynamic gestures, we tested recognition accuracy on the following three popular gesture sets: the $1 gesture set [17,28,38], the 26-graffiti gesture set [6,17], and the two-handed gesture set [28]. The $1 gesture set and the 26-graffiti gesture set were used for one-handed gesture recognition tests. In this experiment, we only compared the recognition accuracy of hand motion trajectories among different methods, regardless of the initial static hand postures in the dynamic gestures. The comparison results are provided in Table 5.
As shown in Table 5, our classifier outperforms the classifiers presented in Kristensson et al. [17] and Pedersoli et al. [28] for dynamic gesture recognition on the three public gesture sets. By adopting a conventional HMM method, Pedersoli et al.’s method achieved an average accuracy of 84.2% for isolated one-handed gestures. However, their classifier cannot identify two-handed gestures. Different from an HMM-based approach, Kristensson et al. used a probabilistic reasoning algorithm to incrementally predict the user’s meaningful gestures while still being articulated. Their system achieved a slightly higher accuracy for two-handed gestures (95.5%) than for one-handed gestures (93.6%) due to the additional information in the probabilistic model. By adopting a CHMM method, our classifier achieved an average accuracy of 96.7% for two-handed gestures. The results show that CHMM is quite suitable for two-handed gesture recognition because it can accurately model the interrelationships between two hands.
Additionally, Kristensson et al.’s and Pedersoli et al.’s methods primarily evaluated isolated gesture recognition. That is, their gesture recognition method assumes that the start and end points of a gesture have already been unambiguously spotted in advance in an input video. In contrast, by using the proposed filter model, our system can reject meaningless gestures effectively at the early stage of gesture recognition and spot meaningful gestures accurately, which in turn achieves an average recognition accuracy of 92.4% and 90.5% for continuous one- and two-handed dynamic gestures without prior purification in a natural TV viewing environment.
To investigate the potential of intuitive freehand gestures as an abstract input device for smart TV systems, we conducted Experiment 2 to compare freehand-gesture-based inputs with a conventional remote-control approach. We are interested in the following issues:
Can gesture-based input work effectively? How do people view gesture-based inputs?
Experimental design
The experiment included three treatments. In Treatment A (TA), participants used a conventional remote control. In Treatment B (TB), participants were asked to use the user-elicited gestures for TV control proposed by Zaiţi et al. [45]. Here, we made a comparison with Zaiţi et al.’s study because a) their research and ours share the same target functions as shown in Table 6, and b) they adopted a similar but slightly different elicitation study method to derive the most popular gesture for each of the target functions. In Treatment C (TC), participants were asked to use the gesture set proposed in our study.
Experimental tasks
Experimental tasks
We recruited 24 participants from a university campus, including 12 males and 12 females. These participants were between 19 and 51 years old (
We used the same apparatus that was used in Experiment 1.
Procedure
During the experiment, all participants were asked to complete a set of TV tasks including menu navigation and content searching (Table 6) as quickly as possible three times.
Table 6 lists the 11 tasks used in this study. The two left columns of the table identify each task with a sequential number and a task name. The middle column indicates the interactive manipulations by using a remote control. The two right columns indicate the user-defined gestures proposed by Zaiţi et al.’s study and ours for each task. As shown, the user-defined gestures for Tasks 6 through 12 in both studies are the same.
Our experiment was a within-subject design. A Latin square was used to counterbalance different treatment orders. Participants were randomly assigned to these orders.
Before the study, participants were allowed to practice until they successfully completed a task similar to the one in our study. We collected data on user performance and user satisfaction with provided interaction tools. The performance was measured by task completion time, i.e., the time interval between the moment a task started and the moment a participant correctly finished the task. After completing all tasks, each participant was asked to complete a questionnaire concerning their opinions of the three interaction techniques. Participants used a Likert scale (1 worst, 7 best) to evaluate the three techniques in terms of ease of use (i.e., the degree to which a technique can be used to perform a task with effectiveness), ease of learning (i.e., how easy it is to learn how to use a technique), efficiency (i.e., the degree to which a technique can perform a task correctly), intuitiveness (i.e., the degree to which participants can use a technique directly by intuition without rational thought), physical fatigue (i.e., how difficult it is to use a technique physically), and enjoyableness (i.e., the degree to which participants like to use a technique to perform a task).
Results
Figure 7 compares the task completion time for the three treatments. The overall time for TB and TC to complete the given tasks is less than that for TA. As shown in Fig. 7, the average task completion times in TA, TB, and TC are 76.46 s (

Comparison of task completion times.
Figure 8 shows the qualitative evaluation results. As shown, both of the freehand-gesture-based interaction techniques (TB and TC) were perceived better than the Remote Control (TA) in terms of ease of use, intuitiveness, and enjoyableness. In comparison, TA was perceived as easier to learn and more comfortable to use than TB and TC. No significant difference was found between TA and TC in terms of efficiency (

Ranking of the three techniques in terms of ease of use, ease of learning, efficiency, intuitiveness, physical fatigue, and enjoyableness.
The experimental results show that both TB and TC allow users to interact with the TV easily and intuitively, and both perform significantly better than TA. Our results indicate that TB and TC can also offer improved user experience and user satisfaction from several perspectives.
To more closely evaluate the difference between the three techniques, we found that the required time for participants to perform the task of content searching, e.g., searching for a specific movie, by using TA is significantly longer than that by using TB and TC. This may be because with a remote control, the user has to press arrow keys “←”,“→”,“↑”,and “↓” to select digits or characters from the left panel of the search box (Fig. 9). Thus, the user often must shift their attention between the television screen and the remote control. In comparison, with dynamic gestures to draw Arabic numbers in the air, participants could always keep their eyes on the TV screen and avoid the eye-hand separation problem in TA. Therefore, those gestures are more direct and efficient in performing the given tasks.

The user is searching for a movie named “2012”.
The results of subjective rating in our research show that TA is perceived as the easiest technology to learn because it does not require the user to remember gesture semantics defined in TB and TC.
In contrast, TB and TC are perceived as more enjoyable than TA. This may be because TB and TC were designed based on the results of the participatory study of people’s intuitive gestures when interacting with a TV. The user-defined gestures supported by TB and TC involve no complex or strange movements. The simplicity and ease of the gestures, as well as their natural interaction, may contribute to the perceived benefits.
In general, participants showed more positive attitudes toward user-defined gestures proposed by our method than by Zaiţi et al.’s study. For some tasks, such as Turn on the TV, Turn off the TV, and Mute, participants preferred the two-handed gestures designed in our study over the one-handed gestures proposed by Zaiţi et al. One participant stated that
I would like to use the two-handed gesture, both hands moving from the center middle to the outer left and right for Task 1 – Turn on the TV, because it looks like opening a door to a TV show and makes me feel more involved and immersive.
Some participants said that the two-handed gestures could also be used to perform more complex tasks such as Image Scaling and Image Rotating. Previous research has provided empirical evidence on the benefits of practices involving tow-handed gestures in user interfaces [15]. All these findings indicate that two-handed gesture input is a promising technique to improve the directness and degree of manipulation afforded by the user interface.
In addition, participants also noted that the two gestures, Close palm and Close fingers into pinch proposed by Zaiţi et al. for Task 2 (Turn off the TV) and Task 5 (Mute) often caused confusion because the two gestures are very similar and difficult to distinguish. Sometimes, they just wanted to use the gesture to mute the volume, but they would accidentally turn off the TV.
Although TB and TC achieve good scores from several perspectives, some important problems still exist, such as comfort in interaction. It is not surprising to see that participants perceived TA as a more comfortable means than TB and TC. Similar findings were reported by Cabral et al. [3]. A mouse or a remote control can be easily grasped in the palm, with all tasks performed using an index finger or a thumb. In contrast, more muscle groups and longer movements are required in gesture-based interaction. The lengthy interaction may lead to body fatigue. Although in general our participants rated gesture-based input very high, showing the same enthusiasm for new technologies that has been seen in previous research [3,7,15,33], we believe that these problems should be a concern and more efforts are needed to design more comfortable gesture-based commands.
Controls for gesture-based interactive TV systems must be accurate, intuitive, physically appropriate and user-friendly. Although considerable work has been reported in recent years for gesture-based TV applications, many of them fall into the category of “a solution looking for a problem”. In this paper, we created a living room environment and adopted a top-down paradigm to explore these qualities in a study of using natural freehand gestures to interact with a smart TV system.
The main contribution of this study lies in the unified framework we proposed for gesture-based applications. This framework has two advantages: 1) We derived a set of 19 user-defined gestures for TV control tasks by using an improved elicitation study method. Compared with many gestural systems designed by experts, this study provides valuable insights into the real needs and expectations of TV viewers and helps designers to learn how end-users would complete a task or utilize a gesture-based system and what types of gestures should be developed in a gesture-based TV system. As a result, the user-defined gesture vocabulary can reduce the risk that a system delivered to end users may be unusable and ineffective. 2) We proposed a unified framework toward the recognition of the user-defined gesture set for smart TV systems. Compared with many existing algorithms developed in a lab environment, this framework focuses on addressing some specific problems that exist in a complex real-world TV viewing environment, including the automatic exclusion of many meaningless actions from TV viewers in their daily life, a smooth switching mechanism between static and dynamic gesture recognition as well as between one- and two-handed gesture identification, and the continuous recognition of multiple dynamic gestures in the air (e.g., a channel switching gesture for channel 127).
To validate the proposed method, we developed a gesture-based TV control system and conducted two experiments and a comparative analysis. Experimental results show that the proposed method can achieve satisfactory recognition accuracy for both static and dynamic gestures, as well as for one- and two-handed gestures. Even in a real TV viewing environment, the gesture-based interaction design allows users to interact with a smart TV system effectively. Participants were also enthusiastic about this new technique.
A limitation of this study is that we did not consider people with limited mobility or a physical disability. Most participants in our study were students and professors from a university aged between 18 and 51. We chose these participants for two reasons: our limited access to subject pools beyond this age group within a university, and our expectation for subjects who can accept and learn new HCI techniques such as freehand-gesture-based interaction with little difficulty. In practice, ordinary users can obtain similar recognition accuracy as shown in Tables 2 and 3 by following the training process illustrated in Section 5.1.
An interesting next step is to study the influence of cultural factors on the design and choices of gestures. Even within the same application domain, different researchers with different cultural backgrounds may obtain different results by running independent elicitation studies. For example, for the same Turn off the TV command, the elicited gestures in Kühnel et al. [18], Vatavu et al. [33], and our study are significantly different. However, for four other commands – Volume Up, Volume Down, Previous Channel and Next Channel – the elicited gestures in these three studies are strikingly consistent. Such cultural differences may lead to more challenges in technical design (e.g., the recognition algorithms). Another interesting future step is to explore a reasonable integration mechanism for freehand-gesture and voice-based multimodal interaction for those with disabilities in TV viewing environments.
Footnotes
Acknowledgements
The authors would like to thank the anonymous reviewers for their insightful comments. This work was supported by the National Natural Science Foundation of China under Grant No. 61772564, 61772468, 61202344 and the funding offered by the China Scholarship Council (CSC).
