Abstract
A gesture elicitation study consists of a popular method for eliciting a sample of end end users to propose gestures for executing functions in a certain context of use, specified by its users and their functions, the device or the platform used, and the physical environment in which they are working. Gestures proposed in such a study needs to be classified and, perhaps, extended in order to feed a gesture recognizer. To support this process, we conducted a full-body gesture elicitation study for executing functions in a smart home environment by domestic end users in front of a camera. Instead of defining functions opportunistically, we define them based on a taxonomy of abstract tasks. From these elicited gestures, a XML-compliant grammar for specifying resulting gestures is defined, created, and implemented to graphically represent, label, characterize, and formally present such full-body gestures. The formal notation for specifying such gestures is also useful to generate variations of elicited gestures to be applied on-the-fly on gestures in order to allow one-shot learning.
Keywords
Introduction
Natural Language Processing (NLP) [1] is often defined as the science of making computers understand information in the same way as a person does. Human communication may be analyzed, using a basic taxonomy, according to whether messages are transmit- ted through verbal or non-verbal mechanisms. It is especially interesting that despite the fact multiple investigations mention that approximately 65 percent of the human communication is carried out through nonverbal mechanisms [2], it is possible to observe in the literature that most of the research is focused on analyzing speech, and that emphasis on Gesture Recognition (GR), considering facial expressions and body motion, has been given in the last two decades as consequence of the development of processing techniques like Machine Learning (ML) algorithms, data processing techniques, and of hardware devices.
GR for interaction purposes has been addressed during the last decades following different approa-ches and through its application to a number of domains such as education, sports training, rehabilitation, and robotics [3–6]. Current proposals report to achieve accuracy levels close to 100 percent using machine learning techniques and following traditional training procedures on which a large number of samples are gathered, processed and labeled. Due to its nature, the implementation of mechanisms for dynamically adapting gestural interfaces to a variety of contexts of use (defined as a user and related tasks, a platform, and an environment [26]), demands the consideration of an approach such that makes it possible for users to enrich the training of the recognition system providing a gesture only once, this is, training must be performed using a single sample.
We introduced a methodology [7] for GR relying on user-defined gestures sets considering a One-Shot Learning (OSL) approach [8]. However, even though the stages for achieving GR and for characterizing the obtained gestures are described, there is a lack of a formal notation for describing such gestures, making it impossible to compute the gathered information and then, using it for enabling the training of a classifier applying OSL. Often, gesture recognizers work on gesture templates that are defined in their own physical format, such as points in space or vectors [20].
In this paper, we report on how a gesture elicitation study was conducted on full-body gestures for executing abstract tasks, from which a grammar was defined and represented through informal and formal means, making it possible to feed this information to a classifier and to carry out OSL. The purpose of using abstract tasks, observed through their instantiation on the possible commands over a streaming platform, is on that in this way, the obtained grammar will be useful for different interaction domains. We also provide description of rules to be followed for achieving training following the OSL approach.
The rest of the paper is divided into sections. Section 2 contains a brief description of the state of the art including work on full-body GR and an introduction to abstract interactive tasks. The setting up of the experiment and its performance are described in Section 3. Later, Section 4 is dedicated to present the gathered results, showing the obtained grammar and explaining the notation used. Section 5 discusses the insights and the contributions of the paper. Finally, in Section 6, conclusions and the future work to be addressed are presented.
State of the art
This section provides a brief state of the art on full-body GR and including the definition and description of abstract interaction tasks.
Gesture elicitation study
To illustrate the purpose of a Gesture Elicitation Study (GES) [23], let us consider a user interface designer willing to design a full-body gesture user interface to enable end users to control a smart home environment with commands such as turn lights on and off, dim lights, and make lights brighter. A sample P of the target population is defined, typically at least 30 in number, i.e., |P|≥30, to participate in a GES. Each participant is presented with the effect of each function, the intensity of the lights is increased for the “make lights brighter” function, and the participant is elicited to propose a full-body gesture, for example captured by a camera, that would generate the desired effect. When the study is completed for all functions and participants, the designer has collected a resulting set of 120 elicited gestures = 4 (functions) × 30 (individual gestures). In order to decide which gesture to keep for each function, the designer browses the resulting set of gestures to understand whether there are any gestures in agreement for each function. This agreement could be acquired by computing various formula [25] or by frequency. If the consensus conditions are fulfilled, the designer can be confident that the elicited gestures are intuitive, and other end users would likely guess, learn, and perhaps prefer the same gestures than those elicited in the study.
Full-body gesture recognition
Gestures are (static or dynamic) expressive, meaningful body motions involving physical movements of the fingers, hands, arms, legs, head or face. Through gestures, it is possible to communicate in a nonverbal way and enrich verbal communication as well [9]. A systematic literature review including 60 works on GR was performed following a methodology [7]. As relevant insights, it was noticed that 27% of the works are focused on the study of gestures of the hands using leap motion sensors, depth cameras, wearable devices, and RGB cameras; 7% of the works are aimed at describing interaction by means of lower limbs, using wearable devices and depth cameras; and that only 10% of the analyzed works consider full-body interaction, most of them using depth cameras as data acquisition devices. A deeper revision to the considered works, reported that researchers on GR focus on the study of the most expressive body parts even though they use technology that enables the recognition of full-body gestures. This is understandable, as users are more likely using the most expressive body parts to interact with systems, i.e., hands and facial expressions, however, there are contexts on which users need to provide gestures using less expressive body parts, consider for instance, a rehabilitation system focusing on knee movement recovery. One of the works regarding full-body interaction is [10], in which users were asked to perform gestures for controlling a humanoid robot. In that study, Obaid et al. proposed navigational commands and analyzed gestures provided by users, reporting agreement scores, time performances, and graphical representations of the consensus set. Along with the elicitation of the gestures set, significant insights were discussed like the need of the definition of points of view when trying to recognize gestures for navigation, i.e., if users’ motion is being tracked using cameras in front of them, it is possible to misunderstand the direction of the gestures. It is interesting for this research to point out that most of the proposals in the literature addressing elicitation studies have the purpose of providing the base for creating tailored solutions, as they are per-formed on specific tasks within determined applications. As the objectives of this work consider a generalization of the to be obtained gesture grammar, abstract tasks are considered. The following subsection is dedicated to the description of them.
Abstract tasks
Abstract tasks are defined as standard actions [11]: it is possible to find a list of all the interactive tasks that users may perform over an application. Keeping awareness of such tasks, presented in Table 1, while designing user interfaces enables the definition of interaction mechanisms for their instantiations that are specific for the task and context of use.
Standard tasks and use examples from Constantine [11]
Standard tasks and use examples from Constantine [11]
The intention followed with the definition of abstract tasks was to provide a notation for designing graphic user interfaces. However, for the purposes of this proposal, they will be considered when preparing the experiment, making it possible to define a gesture grammar associated to abstract tasks enabling its use for different contexts. The following section is dedicated to the description of the experiment that was performed in order to obtain a gesture grammar associated to abstract inter-action tasks.
In order to identify a gesture grammar for interaction, we decided to perform a gesture elicitation study as other authors in the literature have reported to do [12–14]. A first approach for this was on asking users to provide gestures for abstract tasks as such described in section 2.3. However, it was noticed that users needed to have specific examples of the correspondence of the tasks on determined applications for better understanding the requirement. This led us to decide using the Netflix streaming platform as base but keeping awareness of the relation of the analyzed tasks and abstract tasks. Developing an interaction mechanism for gesture-based applications is a challenging time-consuming activity. This, added to the objective of identifying gestures instead of recognizing them, made it evident that the implementation of a system for gesture recognition was out of the scope of this work, and that in its place, it would be necessary to apply a technique for gathering information of the interaction.
The Wizard of Oz technique has been used throughout the history of the development of interactive systems [15, 21], and in particular, in the field of natural interfaces development as it is a way to collect data for mixed reality environments or movement commands for interaction in several application domains [16]. Following this technique, the designed experiment consisted on asking users to use their own body gestures to interact with the Netflix streaming platform using only specified body parts and giving them the feeling that those commands actually worked on the application, but actually providing the input to the system through keyboard and mouse commands.
The setup of the experiment, as shown in Fig. 1, took place in a Gesell room and included two computers (PC1 and PC2), the first one for retrieving users’ movement information from a webcam and allowing to document the experiment; while the second one was connected to a projector for providing the user with feedback and simulating users’ interaction with the system. An observer of the experiment was designated and located next to the user in order to provide support in case any doubt arises. The experiment controller managed the devices to document the experiment and to explain how each user performed the activity.

Experiment setup following the Wizard of Oz technique [16] and its disposition in a Gesell chamber.
During the experiment, 90 users divided in 6 groups were asked to provide the gestures they would use to perform 23 commands within the Netflix streaming platform. The 23 tasks that were defined are presented in Table 2 along with their associated abstract task. When users carried out gestures for each of the requested commands, data was gathered including response time, a video record of the corresponding gesture, and a score given by the same user with respect to how natural they thought the gesture was for the task. As a total, 2070 records were acquired and then processed through manual labeling, manual comparison, to finally be assigned with an agreement score.
Defined tasks for the gesture elicitation study and their association to abstract interactive tasks from [11]
Those gestures having the highest agreement scores for each of the defined tasks were considered as part of the grammar, graphically represented for quick identification, textually described and characterized in terms of an anthropomorphic notation [7], for later using a formal notation in order to allow their processing and inclusion as part of a classifier. The results of the performed experiment and the proposed gesture grammar are described in the next section.
Once the experiment was prepared and the 90 participants were recruited, they were divided into 6 groups of 15 users each, considering balance in terms of gender, age and whether they were left-handed or right-handed. Each of the groups was meant to fol-low the same protocol for the experiment as introduced in the previous section, with the particularity of having to interact with the system using different parts of their bodies. The first group was instructed to use only facial expressions, the second group was able to use only movement from neck and shoulders, the third group could only provide gestures using their arms, group number 4 was only allowed to use their hands, group 5 used their legs for interacting, and group six had to use only their feet. The results of the performance of the experiment are addressed in the following subsections.
Obtained grammar for interaction with a streaming platform and its correspondence with abstract tasks
As a total, 124 different gestures were identified during the elicitation study. Such gestures and combinations of them, were used by the participants for expressing all of the commands within the streaming platform. Regarding facial expressions(Table 3), 16 gestures were identified, grouped according to the tasks in which they were used, labeled, described and characterized.
Facial expressions obtained in the gesture elicitation study
Facial expressions obtained in the gesture elicitation study
When studying gestures made using only neck and shoulders, we selected 21 of the samples that users provided in Table 4. As it was done with the facial expressions, this set of gestures was also labeled, grouped, described and characterized.
Gestures from neck and shoulders obtained in the gesture elicitation study
After analyzing gestures that subjects provided using their arms, considering only motion from shoulders, elbows and wrists, we selected and decomposed them in 22 different gestures as presented in Table 5.
Gestures from arms obtained in the gesture elicitation study
In the same way, the 23 gestures from the hands included in Table 6 were gathered during the experiment.
Gestures from hands obtained in the elicitation study
After completion of the analysis of the upper body, the experiment continued with the analysis of lower limbs. Table 7 presents the 27 gestures from legs, considering hips, knees and ankles, on which use for interacting with the streaming platform subjects agreed.
Gestures from legs obtained in the gesture elicitation study
Finally, in order to complete the study, gestures made using the feet were reviewed. Table 8 summarizes the information gathered for the 15 selected gestures for interacting using feet.
Gestures from legs obtained in the elicitation study
A gesture grammar allows not only the identification of gestures for interaction, but it also enables user training through the use of graphical representations. Fig. 2 reproduces an excerpt of the complement of the non-formal representation of the gesture grammar corresponding to arms and legs gestures. If the informal representation of the gesture grammar, which was obtained considering the data gathered in the elicitation study, is useful for proposing an interaction mechanism and for user interfaces engineering, the objective of our proposal is aimed at automation of the gesture recognition and classification training processes focusing on OSL, a formal representation for storing and processing the incoming data is required. The next subsection describes the proposed notation for this purpose.

Excerpt of the non-formal graphical representation of the identified gestures from arms and legs.
Enriching the proposal in [7] with a notation for formally describing gestures and hence enabling the storage and processing of incoming data for further feeding to a classification or recognition stage, in this paper we define an XML file with a structure such that complies with anthropomorphic metrics and contains the possible motion and body parts according to the capability model [17]. The XML structure considers both eyes’ point of fixation, vertical rotation, horizontal rotation, and eyelid distance, which is particularly useful for expressing head and shoulders gestures [24]; the brow head position, brow arch position, and the brow tail position of both eye-brows; nasal tip position, nasal ridge wrinkling, left nostril flaring, and right nostril flaring of the nose; the lip distance and position of the mouth; the tip position, curling, turning and folding of the tongue; and the inflation and contraction of both cheeks as elements related to facial expression. For the neck gestures, the structure allows to record information from flexion, extension, right lateral rotation, left lateral rotation, right lateral flexion, and left lateral flexion. In a similar way, for the upper limbs, the proposed structure contemplates the input of information regarding flexion, extension, abduction, adduction, external rotation, and internal rotation of both shoulders; flexion, extension, and hyperextension of both elbows; supination and pronation of both forearms; flexion, extension, radial deviation, and ulnar deviation of both wrists; supination and pronation of both hands; and flexion, extension, adduction, abduction, and hyperextension of finger joints. Finally, regarding the lower limbs, the proposed structure allows entering information of flexion, extension, adduction, abduction, external rotation, and internal rotation of the hips; flexion and extension of both knees; dorsiflexion, plantar flexion, eversion, inversion, pronation, supination, lateral rotation, and medial rotation of both ankles; and flexion and extension of each of the five toes of both feet. By having a structure for formal representation of gestures, it is possible to manually – or automatically – record and process data coming from a source for different purposes, such as automated analysis of guidelines [22]. Fig. 3 instantiates the A006 gesture in the proposed XML structure. The information encoded in the XML file may express degrees, centimeters, or relative metrics. A discussion on the use of the proposal and regarding additional results that were gathered during the realization of the experiment is presented in the following section.

Formal representation of the A006 gesture along with its graphical representation.
The grammar presented in the previous section may be used for the achieving the engineering of gesture-based user interfaces. The association between the abstract tasks defined in [11] and specific commands within the analyzed domain, as well as the identification of gestures which users relate to those commands, allows generalization of the obtained gesture grammar. However, it is important to point out that the nature of gestural user interfaces makes it necessary to consider cognitive, cultural and linguistic aspects for customization purposes [18]. The population which participated in the experiment consist-ed of 47 female and 43 male subjects, all of them Mexicans aged between 16 and 57 years (M = 38.19, SD = 11.88).
During the experiment, it was possible to confirm that there are many-to-one mappings from concepts to gestures and vice versa as stated in the literature [9]. To observe this, notice that for the 23 studied tasks, users provided a total of 2070 gestures (345 gestures for each of the six groups). From those gestures, 124 gestures were selected according to their agreement scores and after disambiguation and division of complex gestures into simple, atomic ones. Besides, as it may be seen in Tables3-8, some of the gestures are used for accomplishing several tasks while others are specifically related. Just after providing each gesture, users were asked to give a score between 1 and 10, according how difficult, memorable and descriptive they thought the given gesture was for representing the required command. We expected to see relationships existing between the response times and users’ grades for each gesture and the agreement scores or final assessment, however, in a similar way as reported in [7] there does not seem to be any.
As part of the followed experimentation protocol, users were asked to fill in the IBM CSUQ [19] in order to get data regarding their appreciation towards the system in terms of usability. The IBM CSUQ instrument is composed of 19 questions, with a 7-point Likert scales concerning system use (Q1-Q8), information quality (Q9-Q15), UI quality (Q16-Q18), and general feedback to the system (Q19).
Results of the applied IBM CSUQ reported that interaction using lower limbs was less satisfying for users as they assessed it with an average of 5.13 out of 7 points (SD = 1.17) while after simulating interaction using upper limbs, subjects gave an average score of 5.8 out of 7 (SD = .79). This difference may be caused due to lack of expressiveness of the lower limbs, but also by the fact that subjects are used to use their hands for interacting with applications rather than their legs or feet. Along with the quantitative information which was gathered, i.e., response time, gesture score, agreement score, and usability evaluation, it was possible to get qualitative insights regarding the experiment. During the experiment, participants be-longing to the neck and shoulders, legs, and feet groups complaint about how difficult was for them to use only such parts of their bodies to interact with an application and for achieving complex tasks. It was necessary to introduce them to a scenario on which interacting with limited parts of the body was not only desired but critical, e.g., due to disability, for rehabilitation purposes, and due to characteristics of the context.
Regarding the resultant proposal in terms of the gesture grammar and its non-formal and formal notation, it is possible to use it already for user training purposes, for gesture-based applications design, and for starting the training of an automatic gesture classifier. Nevertheless, some clarifications need to be done. The XML file structure was created in compliance with the user capability model from [17], but its extension with supplementary interesting features is possible through the addition of tags into its body. Simplification of the structure is also possible through tags removal. This reduction may not only apply to complete blocks of information (researchers may remove facial expressions from the file if not interested on tracking such type of interaction), but also to specific features. Another interesting consideration is on that the proposed structure for formal representation of gestures includes opposite information, e.g., flexion-extension, inflation-contraction, abduction-adduction. These opposite tags may not be present in a final instantiation as they have repetitive information and their use depends on the approach followed.
The representation of static gestures may be achieved through the use of one single XML file, while for dynamic gestures it is necessary to group a series of files according to different times on which significant changes occur. Additional tags should be added to the basic structure in order to support the inclusion of the order or appearance and time or speediness.
In order to achieve the general objective of supporting OSL as mentioned in the introduction, the formal notation enables a gesture to be understandable by computers, however, so far it would not be possible to complete a classifier training process with one single sample. Thus, we identified the need for defining a mechanism for using data gathered for one gesture for training an automatic gesture classifier. To this purpose, we propose to apply slight variations on the formally represented information in order to stimulate multiple samples for carrying out the training process.
The definition of such transformations implies the knowledge of motion restrictions, e.g. maximum rotation degrees for a joint, maximum distance between joints; as well as of tolerance levels in which one gesture may be varied without generating ambiguity. In the following section, conclusions and the future work to be addressed are introduced.
Conclusion and future work
In this paper, we report the results of a full-body gesture elicitation study performed analyzing tasks within the Netflix streaming platform and aligned to abstract interactive tasks. Such results were presented through a non-formal representation using a textual description and a motion characterization, and then using a formal notation by means of a flexible XML structure containing tags for recording information describing the characteristics of the gestures depending on the followed approach. For the use of this formal notation, it is necessary to assume the existence of a base position from which all of the recorded movements may be relative to.
This work includes the presentation of quantitative and qualitative insights as well, in terms of usability, number of gestures gathered, agreement scores, grades provided for each gesture, and subjects’ considerations and limitations. To the best of our knowledge, there are no other proposals considering abstract tasks during elicitation studies nor defining transformations on single samples for addressing OSL. To compare the obtained gestures with other proposals, it will be necessary to analyze such related works considering the tasks that were studied and associating them to abstract tasks. This, along with the definition of a complete set of trans-formations and variations for addressing OSL are identified as future work.
Acknowledgments
Jean Vanderdonckt acknowledges the support of the project EDBS Emerald Casting Assistant funded under Convention No. 7901 of MecaTech Competitivity Pole, Walloon Region, Belgium.
