Abstract
Through emerging technologies, it is possible to use hand gestures to interact with computing systems in the form of embodied human-computer interaction (eHCI). There is much research done on improving gesture recognition accuracy and on understanding the factors that influence intuitive gesture choice; however, there is a lack of work investigating how to design the interface for 3D gestural interactions. Therefore, a between subjects experimental study was done to study the effect of interface design (e.g, 3D vs. 2D) on intuitive gesture choice and cognitive load for performing an embodied interaction. Two out of ten functions had the same intuitive gesture function mapping for 2D and 3D conditions. However, many of the functions had different mappings between the two different display types. The results illustrate the differences in embodied interactions between 2D and 3D interfaces, and future work should investigate the interface design comprehensively.
Introduction
Communication, both verbal and nonverbal, is integrated into everyday life. Human-human communication surpasses physical barriers, language differences, and life experiences. Modern day communication can be achieved across great distances regardless of native language. Communication is not limited to words, but can be conveyed through facial expressions, tone, and nonverbal cues through using our bodies to interact, also known as embodied interactions (Spiel, 2021). As emerging technologies are developed, the way humans communicate and interact with computers adapts which introduces the opportunity for embodied interactions with computer systems.
Through emerging technologies, it is possible to integrate embodiment into interacting with computers in the form of embodied human-computer interaction (eHCI). Embodied interactions have shown promise in several domains (Benford et al., 2021; Dworkin et al., 2018; Lee-Cultura et al., 2020; Sarma & Bhuyan, 2021; Spiel, 2021). Technologies such as augmented and virtual reality implement an eHCI style using virtual reality (VR) headsets or data gloves where people can now use their bodies to interact with computers, but these technologies are invasive to the end user as they are typically bulky equipment. However, there are several emerging technologies that are non-invasive to the end user such as 3D gestural camera systems.
3D gestural technology uses a camera system to detect motion in the hands, fingers, and body. The gestural recognition software typically involves three functions: gesture classification and processing, feature extraction, and response (Dworkin et al., 2018). There is a body of research investigating hand gestures to interact with computers and has shown positive results in anesthesia (Jurewicz et al., 2018; Jurewicz & Neyens, 2017), surgery (Jacob & Wachs, 2014), and for general HCI tasks (Aigner et al., 2012). Much of the research for studying eHCI in the form of 3D gestural systems has focused on accuracy of the gestural recognition algorithms (Sarma & Bhuyan, 2021; Dworkin et al., 2018; Panwar & Mehra, 2011). There has been some work investigating the factors that influence embodied interactions (e.g., intuitive gesture choice), and it has been shown that context (Aigner et al., 2012), domain expertise (Jurewicz & Neyens, 2018), and stimulus-response compatibility (Janczyk et al., 2019) influence one’s embodied interaction in a 3D gestural system.
Although there has been work investigating the factors that influence embodied interactions and improving recognition accuracy, there is a lack of work investigating interface design for 3D gestural HCI. The embodied interaction is a 3D movement; however, it is unclear how the interface should be presented, such as 3D or 2D interfaces. It has been shown that 3D interfaces are preferred and are more accurate compared to 2D interfaces for rotation of objects but not for position accuracy (Bueckle et al., 2021). 3D displays are favored for shape understanding, but 2D displays are preferred by users for relative position tasks (St. John et al., 2001). There is not a consensus among researchers on 2D vs. 3D interfaces for HCI functions, and furthermore, much of the research has focused on the traditional keyboard and mouse as the computer input.
There have not been any studies comparing 2D and 3D interface for 3D gestural HCI. Therefore, the objective of this work was to investigate the effect of interface type (e.g., 2D or 3D) on embodied interactions, in the form of 3D gestural HCI. This objective was explored through a user-elicitation gestural study where participants performed traditional HCI tasks with either a 2D or 3D interface. The objective was broken down into two aims: 1) to identify the differences in intuitive gesture choice between 2D and 3D displays and 2) to identify differences in cognitive load between 2D and 3D displays for performing embodied interactions.
Methods
A between-subjects experimental study was performed to study the effect of interface type on embodied interactions. Participants (n = 30) were required to be able to read, write, and speak in English and have full manual dexterity of fingers, wrists, and arms. Each participant was assigned to either the 2D or 3D interface group and performed hand gestures for general HCI functions. The independent variables of this study were function and interface type, and the dependent variables collected were gesture chosen and response time. The study was approved by the Oklahoma State Institutional Review Board (IRB-22-467-STW).
Experimental Setting and Equipment
This study took place in the Human-System Engineering and Applied Statistics Lab at Oklahoma State University. Both the 2D and 3D interfaces were placed in the same room but in different areas (see Figure 1). The 2D interface was on a Dell 22-inch LED monitor connected to a PC running Windows 10. The 2D interface included various colored shapes with a black background (see Figure 2). The 3D interface was a variety of 3D, colored blocks that varied by shape and were placed on a tabletop, as seen in Figure 1. In both interfaces, the participants were not allowed to touch the display and could only gesture in reference to the shapes.

Experimental setup with 2D interface on computer monitors and 3D interface on tabletop.

2D Interface showing objects that are to be manipulated.
Procedure
At the beginning of the experiment, the informed consent process was completed. The participant then completed a demographics survey and a user complacency potential survey. Participants were then introduced and trained on the gestural technology used in the study. Participants were assigned to an experimental condition (i.e. 2D or 3D) randomly on an alternating basis.
All participants performed functions for general HCI tasks (n = 10). The list of functions is shown in Table 1. The 10 functions were presented across three blocks and were randomized within each block; therefore, each participant saw one function three separate times for a total of 30 gestures performed. The researcher would verbally state the function to the participant during the experiment, the participant would perform a gesture that they believed completed the function, and the researcher would state the next function.
List of HCI functions performed for 2D and 3D interfaces.
For the 2D experimental group, participants sat in front of the monitor and the Intel RealSense camera, and the researcher sat beside them out of view of the camera. The 2D display was viewed via Microsoft PowerPoint with integrated VBA code to record the user’s reaction times in Excel. The researcher read tasks that related to specific shapes on the screen and simultaneously changed the PowerPoint slide to document the reaction time. The gestural video data and a 24-hour clock simulation were recorded via a second monitor. After the last gesture was performed, the researcher stopped the recording of the gestural output and turned off the 2D display.
For the 3D experimental group, the participant completed the preliminary surveys in the same seat as the 2D condition, then stood behind a white table on top of which there were physical, colored blocks. The blocks were the same color, shape, and position as the 2D display. They faced the same two computer monitors, but only one was used in the 3D condition. The researcher recorded gestural data via a webcam mounted on the Dell monitor, as well as the same clock simulator. The researcher faced participants and read the same tasks pertaining to the 3D display. After the gestural data was collected, the researcher turned off the camera and stopped recording.
After completion of the experiment, an informal interview was conducted to gather data on how gestures were chosen and any experiences that may have influenced their gesture choice. The participants were then debriefed and given $10 Amazon gift card in compensation for their time.
Data Collection and Analysis
Two researchers independently analyzed the gestural footage and classified the gestures as done in previous studies (Jurewicz & Neyens, 2017; Jurewicz et al., 2018; Jurewicz & Neyens, 2020). After analyzing the data independently, results were compared and the researchers reconciled differences between the two sets, thus a consensus approach was used to validate the data. The data was converted and the 2D and 3D results were analyzed independently. R version 4.2.1 and the lmer function of the lme4 package (Bates, Mächler, Bolker, & Walker, 2014) was used to perform the analysis. Any internal inconsistencies in the data were removed before determining the intuitive gesture mappings. For each function, the most frequently performed gesture across each experimental group was identified as the gesture mapping in Table 2 (Nielsen et al., 2004).
Gesture Frequency by Display Design.
Reaction time data was collected through VBA code for the 2D display which was automatically exported to Excel. For participants that used the 3D interface, the researcher used a standard time-study approach which was validated by the clock on each participant’s gesture footage. Gesture choice data was collected through video data. Intel® RealSense™ D435 camera (Intel Corporation, Santa Clara, CA) uses high quality depth per degree information to create 3D representations of the gestures performed (Intel Corporation, 2022). A mixed effects linear regression model was fitted to the data and stepwise deletion was used to find the best fit model for identifying differences in reaction times. Assumptions of normality, homoscedasticity, linearity, no multicollinearity issues, and no outliers were all checked prior to analysis.
Results
The majority of participants were right-handed (n = 25) and between the ages of 19 and 23 with the average age being 21.3 years (SD = 1.55). Slightly more than half of participants were female (n = 17) and the remainder identified as male (n = 13). Half of the participants had used Microsoft Kinect or another version of virtual reality (n = 16) and reported video game use (n = 11).
Intuitive Gesture Choice
Intuitive gesture choice was analyzed in both 2D and 3D conditions. 900 gestures were recorded, but after inconsistencies were removed only 834 were evaluated. 423 were analyzed for the 3D interface group, and 411 were analyzed for the 2D interface group.
There were 31 unique gestures performed in the 2D condition, and 29 unique gestures performed in the 3D condition. All combinations of functions and interface type were mapped to dynamic gestures – there were not mappings of static gestures (e.g., thumbs up) in the data. Overall, the 2D condition tended to have gestures with the palm open forward, and the 3D condition tended to have gestures that incorporated more of a grasping motion, as if the participants were physically trying to grasp the object. Functions 1-4 were directional manipulation tasks and has different mappings between 2D and 3D conditions. For example, Function 1 (Move to the right) was mapped to “forward hand swipe right” for 2D and “down hand swipe right.”
Both gestures had a swiping motion to the right; however, they differed in terms of the orientation of the palm in reference to the display: the 2D consensus being parallel to the vertical screen and the 3D consensus being parallel to the horizontal. Functions 2-4 had similar results where the swiping motion was consistent, but the orientation of the palm differed between 2D and 3D conditions.
Functions 5 and 6 were rotational manipulation tasks and followed the same pattern as Functions 1-4 where participants agreed on the direction of rotation but differed in terms of palm orientation.
Function 7 was to “select” and both the 2D and 3D conditions were mapped to “push fingers” as if the participants were physically pushing a button. Function 9 (Switch) also had the same mapping between 2D and 3D conditions with a “two hand switch” gesture. Function 9 was the only function that was mapped to a two-handed gesture.
Function 8 was to “delete” and the 2D group was mapped to “forward hand swipe left” and the 3D group was mapped to “swipe hand left.” The “swipe hand left” gesture
was with the palm facing to the left with a swiping motion to the left.
Function 10 was to “undo” and the 2D group was mapped to “forward hand rotate left” and the 3D group was mapped to “swipe hand right.” Function 10 had the lowest consensus and the highest variability among all functions.
Reaction Times
The reaction times are summarized in Table 3. The average reaction time for the 2D condition was 4.59 seconds (SD = 1.53) and the 3D condition was 4.95 seconds (SD = 1.91). In both cases, participants’ reaction time was highest in the first block and decreased continuously through block 3. The 2D experimental group’s reaction time in block 1 was 4.76 seconds (SD = 1.42) and decreased to 4.37 seconds (SD = 1.92) by block 3. Similarly, the 3D group’s reaction time in block 1 was 5.61 seconds (SD = 2.59) and decreased in block 3 to 4.22 seconds (SD = 1.09).
Reaction Time by Block.
The final mixed effects regression model included function, display type, video game use, age, and gender where function and display type were an interaction effect. For this analysis, reaction time was averaged across blocks for each function. There was a significant interaction effect for Function 3 where 3D participants took significantly longer to perform function 3 (i.e., Move up) compared to 2D participants (p<0.05) It was also found that age had a statistically significant effect on reaction time (p<0.05) and as age increased, the reaction time increased. One of the limitations of this work is the small variation in age of participants, so this effect may be due to confounding between age and dimension.
Discussion
The objectives of this study were to identify the differences in intuitive gesture choice and cognitive load between 2D and 3D interfaces. It was found that there are significant differences in gesture choice based on the interface that participants used. There was one function that was significantly different for reaction time between 2D and 3D conditions and age also significantly influenced reaction time.
Functions 7 and 9 (Select and Switch) had the same mappings for both the 2D and 3D conditions. The directional and rotational manipulation tasks, Functions 1-6, had different mappings; however, the only difference between the 2D and 3D conditions was the orientation of the palm being either vertical or horizontal. The dynamic motion was consistent between conditions for each function. Participants who used the 2D interface manipulated the display with their hand forward, facing the computer screen. Participants who used the 3D interface manipulated the display with their hand down, facing the tabletop. This trend was seen consistently throughout the first 6 functions, and partially in functions 8 and 10 (Delete and Undo).
Participants who used the 3D interface chose gestures for delete and undo that were perpendicular to the display, where the 2D condition maintained the same orientation as the prior seven tasks. The gesture mapping for “Undo” for the 2D condition was “forward hand rotate left” and mimicked what an “undo” icon on a traditional computer interface looks like. However, the 3D condition was not mapped to this gesture and was rather mapped to “swipe hand right” where participants swiped their hand in the right direction with the palm facing right.
Overall, the functions could be mapped to intuitive gestures, and there appeared to be a consensus in terms of directional movement across conditions for many of the functions. This introduces the idea of defining gestures by the features that make up a gesture rather than defining a mapping by one singular gesture. It has been shown that classifying gestures by their features increases consensus amongst end users and is more effective in classifying gestures for recognition as it avoids semantics and individual differences in gesture choice (Jurewicz & Neyens, 2022). Additionally, using the feature extraction approach can decrease the variability of intuitive gesture choice. For example, Function 10 was “Undo the previous action” which led to a high variability in gesture choice amongst participants. On Function 10, the 2D interface group performed 25 unique gestures, with the highest frequency being 6. If a bottom-up, feature extraction approach is used, then we can extract intuitive features such as the movements being dynamic and identifying the direction of movement. It is impossible to design a system that predicts each user’s expectations, but by using a feature extraction approach designers can increase consensus for responses on highly variable tasks.
There were few differences in reaction time between 2D and 3D interfaces. The literature is inconsistent in terms of which display type improves performance, 2D or 3D (Bueckle et al., 2021; St. John et al., 2001), and this work shows that there may be little difference between the two interface designs in terms of cognitive load. Function 3 (Move up) was found to take significantly longer than other functions; therefore, this function may have a higher cognitive load for 3D interfaces vs. 2D interfaces. Age was also found to influence reaction time, but this effect may be due to confounding between age and dimension as a result of participant demographic limitations.
The main finding of this work is that 2D and 3D interfaces have different intuitive gesture mappings and more work needs to be done to determine how to best design an interface for 3D gestural HCI. One limitation of this study is the limited sample size. A larger and more diverse sample population may identify differences that were not found in this study. Another limitation of the experiment is the low fidelity 3D display. Future work should investigate higher fidelity displays as well as expand the type of display (e.g., 3D interaction on a 2D surface). Future work should also investigate 2D and 3D displays within specific contexts and domains as previous research has shown that these factors significantly influence gesture choice (Jurewicz & Neyens, 2020). Furthermore, there are several aspects of interface design that should be studied, such as the relative position of objects on an interface. It is possible that swiping motions are influenced by the left, right, up, or down justification of objects.
This work contributes to the literature by showing the influence of interface type, 2D or 3D, on intuitive gesture choice and reaction time. There is little work investigating how to design the interface for 3D gestural HCI or eHCI, in general. If it is now possible to use our bodies to interact with computer systems in an embodied manner, the interface must be intuitive and easy to use and design principles need to be developed for ensuring effective eHCI. The use of bottom-up feature extraction in recognizing gesture taxonomy increases usability and decreases variability in the user’s gesture choice; therefore, future work should explore defining gestures by the intuitive features. These findings will benefit interface design in 3D gestural systems and understanding eHCI systems overall.
