Abstract
This paper presents our framework for human robot interaction. The framework has been designed to allow non-experts to control and program complex robots in an intuitive way that is both reliable and accurate. The principle mechanism behind our model is Augmented Reality (AR), we use this to provide a diagrammatic service to the user, the service uses a range of diagrammatic markers including marker-less AR objects, which can be created and connected together and used to command and control the robot and engage in two way communications. We report on two case studies where our framework is applied to a set of command and control tasks, we measure performance across situation-awareness, task-completion-time and cognitive-load. Our results show that our model leads to greater situational awareness and improved task completion times, when compared to conventional interaction methods, such as a gamepad controller. Future work will integrate our model into multi-modal hybrids and extend the case studies to compare against other interaction methods.
Introduction
The rise of the machines is underway, not least in the area of service robots in and around home, work and social settings. A recent survey [8] indicates a clear trend of robots establishing themselves more and more into our everyday lives, looking at the statistics, in 2012 there were some 3 million service robots sold for personal and domestic use; this was an increase of 20% from 2011. This breaks down to an increase of domestic robots by 53% and interestingly in entertainment robots by 29%, which indicates robots are set to become a mainstream device for entertainment, perhaps eventually overtaking mainstays such as the X-Box and Sony PlayStation, or at least providing serious competition.
Robots are influencing our everyday lives, and will continue to do so in more and unexpected ways in the future, in our homes, schools, hospitals, offices and social spaces. This will bring sophisticated robots into contact with people who are likely not to have specialist knowledge on either robots, computers or technology in general. How would an average medical doctor, school teacher, nurse, parent or child interact with these machines in natural and unambiguous way? This presents a major challenge for the field of human robot interaction (HRI). We need to develop effective means for ordinary people to interact and work with ever complex robots, reducing any experience gaps to make the man-machine collaboration intuitive, seamless, unambiguous and even enjoyable. Of course the next question is, how?
Factors affecting HRI
There are a number of modalities we can use to build an HRI layer, these are broadly categorised in Fig. 1 below. We have illustrated modalities across a spectrum of expressiveness and ambiguity, with speech at one end of the spectrum and tactile at the other. We believe that the ideal HRI experience can be achieved by a balanced multimodal approach.

Examples of difference HRI modalities including the concept of combining modalities.
A successful HRI mode or multi-modal system should consider how a human processes information and uses that information to decide how to interact with objects. Cognitive psychologists have modelled the process of human information processing as a series of individual components. Two widely accepted models are those of Wickens and Hollands [4] and Parasuraman [13]. Based on their work four key factors emerge as affecting a successful human robot interaction, illustrated in Fig. 2.

A model of human information processing.
Of these four stages, Prewett [14] demonstrates the Perception and Response stages as the most responsible for generating high levels of task demands in a human robot interaction. This leads Prewett to the conclusion that better HRI systems can be built by reducing the perceptual and response demands of the interaction, so reducing the human’s overall workload during their HRI experience [14].
Prewett decomposes the perceptual demands of an HRI system into six factors, namely frame rate, response delay, field of vision, camera perspective, depth cues, and environmental detail, while response demands are affected by two factors, task performance standards and the number of robot platforms. This implies that HRI systems with good visual display features reduce perceptual demands and systems with relevant levels of automation reduce response demands [14].
The work of Wickens and Hollands [4] compliments this view and suggests that augmented reality could be a good mechanism for balancing a human’s attention between the situated environment and the HRI interface, effectively the interface would be embedded into the scene. In their own words,
Augmented reality may be useful for controlling remote vehicles, or other tele-operation tasks, such as placing an object in a certain location by controlling a robotic arm [4].
This leads to the motivation of our framework, which makes use of Augmented Reality to build a service allowing humans to instruct and communicate with robots using diagrams. We extend the ideas hinted at by Wickens and Hollands beyond notions of AR objects as simple way-points to informational objects where they can convey actions, messages and meaning. The AR objects can be linked together into a topology to form an informative diagram embedded into the environment which can be read and modified not only by the human and the robot, but also other robots and other humans. Our vision is similar to the idea of visual block-based programming languages which instruct computers. In our framework, the program is instead, embedded into the environment using AR and is used to instruct a robot.
Augmented Reality (or AR) refers to the representation of virtual graphics objects on top of a real-world scene [6]. The aim of AR is to deliver the sensation that virtual objects are present in the real-world in the same way real physical objects are. Sophisticated computer vision algorithms have been developed to achieve this effect and have become quite convincing at overlaying virtual 3D objects on top of a real-time video stream so that they appear to belong to the scene displayed by the camera. The AR rendering process can be thought of as a problem of tracking the camera pose (position & orientation) with regard to the observed scene structure [7]. There are generally two categories of AR algorithms. The first category relies on tracking known object markers which have special features that enable the camera pose to be recovered, this category of AR algorithms are known as marker based AR. The other category relies on the identification of naturally available optical features in a scene, such as edges, corners, or special points of interest to compute the camera pose at frame rate. This is computationally expensive, as these features are analysed and tracked iteratively frame-to-frame. With the camera pose recovered, a graphics rendering engine (for example OpenGL [16]) can be used to embed AR objects into the scene structure giving the illusion of physical existence. We demonstrate a typical AR application in the online video footage here.1
Our framework uses AR as a mechanism to facilitate human-robot interaction and is explained in more detail in our earlier works [12] and [11]. For completeness we present a summary of this work below, framed within a new conceptual point of view. We can view AR as a mechanism that uses 2 dimensional diagrammatic objects such as barcodes, QR codes, and fiducial markers to augment a given physical space with virtual 3D objects, as illustrated in Fig. 3. We can also extend this to objects, where physical objects are recognized and treated as markers, used to overlay the real with the virtual.

Examples of AR diagrammatic objects and their AR rendering including the idea of using physical object recognition.
Marker based AR does have a limitation, the environment needs to be instrumented with these object markers. Object recognition has its own problems, recognizing objects, especially in view variant way to recover orientation, is not easy. To address these problems marker-less AR was developed, which tracks low-level features as outlined above.
Our framework uses all of these AR concepts to build our core HRI user diagram service. Diagrams are very useful tools for humans and they enable an intuitive form of communication across different social groups. They further allow the exchange of information across different geographic, language and cultural boundaries. In a broader sense, diagrams can be viewed as a mechanism for referencing physical space, and can be used to instrument space with markers, instructions and messages, forming the basis of communication. While we have developed a singular form of HRI mechanism, we recognize that as with human interaction, a successful robot HRI is likely to be multi-modal in nature. Our framework model is outlined in Fig. 4.

The Spatial Human Robot Interaction Marker Platform (SHRIMP). The framework uses AR diagrammatic objects to form a diagram HRI service. Users can place objects into the environment, these objects can convey instructions and messages in either direction from the human to the robot or robot to the human. Objects can be linked together to form a diagram topology, and resembles an intuitive graphical programming framework.
From a high-level view, our prototype SHRIMP framework can be seen as a combination of three components which are namely, PTAMM [5], Linear Transformation algorithm and ROS (Robot Operating System). PTAMM is arguably the current state-of-the-art in camera pose-tracking for marker-less augmented reality.2
Google have since announced their project Tango device. Tango integrates a dedicated Inertial Measurement Unit, RGB camera and Time of Flight sensors. This is the first time a robust camera pose-tracking model has been possible on a commercial mobile device, we will seek to use this technology in the next iteration of our framework.

Our PTAMM Linear Transformation function which tracks local map frames and dynamically creates local transformation.
In our extended PTAMM model, the camera’s initial location is taken as the origin of the global frame of reference (C1), and subsequent camera poses are expressed with regard to this global frame. When PTAMM generates a new local coordinate system, the local pose is captured and converted into the global frame using our Linear Transformations. This conversion can be expressed with the following equation,
Finally, our framework is built on top of the Robotic Operating System (ROS). ROS is an open-source software library that provides device abstract, low-level device control and inter-process communication services, together they provide a runtime abstraction layer than enables portable cross-platform robot development across multiple vendors. Figure 6 illustrates the PTAMM tracking modules at the heart of the SHRIMP framework.

PTAMM component hierarchy within the SHRIMP framework.
To test the core functionality of our SHRIMP framework, we initially implement the core set of services within our framework, our camera pose tracking service (extended PTAMM for AR tracking), diagram placement service (AR command objects) and command dispatcher (diagram execution).
We implement our extended version of PTAMM as a service within the Robot Operating System (ROS) via a custom-built software module. Our framework runs on a desktop PC with an Intel Core i7 CPU, running Ubuntu Linux 12.04 and ROS. It is configured with 16 GB of RAM and a 1GB NVidia graphics card with CUDA support. Our platform is implemented in C++ with TooN, libCVD and OpenGL software libraries. Video frames are captured with a standard 1.3 MP USB webcam and was the primary sensor throughout all of the experiments reported here.
The core service within SHRIMP is the motion tracking model implemented through our extended PTAMM. This allows persistent placement of markerless AR objects, which is one of our framework’s key features. We implement extended PTAMM as a real-time ROS service which currently runs in client server mode using the ROS communication framework within ROS. This allows us to host the heavy processing on the connected i7 desktop. Our set-up is illustrated in Fig. 7.

SHRIMP’s communication architecture.
We use Eddie, the Microsoft reference robot from Parallax as our first platform under ROS. We developed a custom ROS driver for the robot, with sensor data streamed to our desktop PC via ROS and the live video feed is provided by a direct USB connection to the on-board camera. Figure 8 illustrates our first experimental run and shows our SHRIMP service running on top of our extended PTAMM model. Here a user is placing a single AR object within the robot’s immediate scene. The object in this case represents a navigational task for the robot to follow in the form of a navigational way-point. In our prototype SHRIMP model, AR object placement is aided by a user interface with a special set of button controls. These button controls permit the user to translate AR objects in all three directions of space, with a six degree of translation freedom (i.e. forward, backward, left, right, up, down). Future developments of the interface will aim at allowing placement with a mobile tablet or phone device.
The two top-most images in Fig. 8 show the robot’s start location in the real world and in the model framework’s global map. The next two images show the placement of the AR navigation objects into the robot’s environmental scene. In more detail the user is placing the AR navigation object (the blue spheres) 1 meter ahead the robot in the XZ plane (1 meter is equal to one unit in our framework set-up) using on-screen controls shown in the top-left of the screen. The final image in the set shows the robot arriving at the defined destination in the real world after navigating to the placed AR object. The run is demonstrated in the online video footage here.4
Further experimentation was done with the model framework to test its capability for navigating through complex paths. For these experiments we used an NXT Mindstorm robot running the ROS framework.

Experimental run with Parallax robot.
Our SHRIMP implementation currently allows the placement of navigational tasks (way-points) and action tasks (gripping objects) which can be connected together to form a task diagram embedded into the real world environment. This is demonstrated in the online video links (
In this section we present the outcomes of two case studies which are designed to evaluate our framework to see how it affects the performance of a series of human robot interaction tasks involving navigation. The first case study focuses on the users’ subjective viewpoint through survey, whereas the second case study takes a more objective approach and uses Electroencephalogram (EEG) data to evaluate cognitive workload.
User case study 1
Our first case study seeks to validate our approach with a key hypothesis.
Does our framework improve the HRI performance for the average person?
To address this hypothesis a comparative navigation task was set up using the Parallax Eddie robot platform. A group of twenty participants aged between 21 and 50 took part in the experiment.5
All experiments reported in this paper complied with Monash University Ethics requirements.
Each participant only had access to the robot’s camera view and they were asked to remotely navigate it over a predefined path while observing the scene around the task. Participants completed the task twice, once remotely operating the robot using a PS3 joystick controller and again using our framework. A joystick controller was chosen as the comparison since these are one of the most widely used forms of HRI method today and most mobile robot platforms, including Parallax, provide native support for them. Participants were randomly assigned a first HRI method, either ours or the PS3 joystick.
To test the hypothesis, the operators’ performance levels are measured via a post observational questionnaire where we asked each participant to recall a set of special environmental features that had been placed within the environment. These special elements were symbolized by a set of fiducial markers positioned randomly across the environment. We chose to use fiducial marker because to human they appear as abstract shapes and require cognitive effort to observe and recall them with any level of precision. For each participant, the special elements and their respective positions were changed between test runs, to prevent persistence of memory affecting the outcomes. In addition to the questionnaire, we also captured the task completion times for each case [17]. The robot area was physically separated from the participants with a blocking partition; this prevented the participants from directly viewing the scene. They could only observe the robot’s environment through the live video feed back to the control station. The experimental set up for this case study is illustrated in Fig. 9.

Case study 1 – experimental set up.
Raw dataset for case study-1
Considering the task completion time, Table 1 at first glance indicates that the average task completion time with our model framework (

Scatter plot of task completion time with SHRIMP.

Scatter plot of task completion time with PS3.
This implies that completion times are more consistent with our model, and tend to be lower when compared to the standard PS3 method of control. This is more clearly seen in the normal distributions of task completion times in Fig. 12.

Normal distribution of task completion times.
We can explain away the evidence of the more robust
Situation awareness (SA) is the second key HRI metric evaluated. As discussed earlier, the number of successful special marker recalls reflects the level of SA possessed by the participant. Figures 13 and 14 show the distribution of SA for both models.

Scatter plot of situation awareness with SHRIMP.

Scatter plot of situation awareness with PS3.
Further distinctions are shown in Fig. 15 highlighting normal distributions for each condition.

Normal distribution of SA levels.
The average SA level for SHRIMP (
In summary, this case study has demonstrated that our diagrammatic HRI service has a significant positive impact on producing consistently lower task completion times and also has a significant impact on situational awareness. Which both indicate a positive outcome to our original hypothesis, that yes; our model does improve HRI performance for the average person.
In this study we asses another key HRI factor cognitive workload. Studies indicate [17] that cognitive workload is inversely proportional to task performance, as cognitive workload increases, task performance decreases, and vice versa. In the context of our framework we expect that our model will require less cognitive load than our compared HRI method with the PS3 controller. We carry out this case study by collecting real-time objective data from the participants using standard EEG data collection methods to assess mental state while actively performing a navigation task with both HRI interfaces.
Analysis of cognitive workload
We use standard EEG techniques for monitoring the electrical activity in participants’ brains in real-time, this will give some indication to the cognitive state of the participants as they carry out their assigned task. According to EEG theory, recorded electrical signals can be categorized into distinct frequency bands namely; Delta (less than 4 Hz), Theta (4–8 Hz), Alpha (8–12 Hz), Beta (13–31 Hz), and Gamma (greater than 32 Hz) and we can use these for the purpose of assessing cognitive workload.
There are two primary factors related to cognitive workload i.e. memory capacity and mental effort [2,3]. Existing psychological experiments show that a lower Theta activity (4–8 Hz) relates to low memory demands, while a higher Alpha activity (8–12 Hz) relates a lower mental effort [1,3,9,10]. Based on these observations we ground our second case study on the following hypothesis.
If average Theta activity of SHRIMP is less than PS3, AND the average Alpha activity of SHRIMP is greater than that of PS3, then SHRIMP has less cognitive workload than PS3.
In this case study ten healthy participants aged between 24–30 years took part. Each participant was subjected to an initial learning phase to familiarise themselves with SHRIMP and PS3 HRI interfaces. Our robot client in this experiment was a LEGO Mindstorm NXT and the camera was fixed externally to the robot so that its experimental area was in complete and full view. Each participant was tasked, to navigate the robot along a non-linear obstacle course to a target location, which was indicated by a solid red marker. The experimental set-up is shown in Fig. 16. In this setup, with SHRIMP, multiple AR objects should be placed and linked together to form a diagrammatic navigational path leading from the robot’s starting position to the target location (red spot). The participants completed the task twice, once using the SHRIMP framework and then again with a PS3 gamepad. Again, the order in which this took place was randomised and participants completed the task remotely using only the camera view, as they did not have direct access to the environmental scene.

The robot’s view of the environment. The red object marks the end-point where navigation path ends.
EEG data was recorded using a 64-channel wired sensor array and the raw data was saved to a binary file indexed by participant’s ID. Control data was collected from each participant, they were asked to sit in a relaxed state and observe camera view at the control console without performing any action. This control was then treated as the participant’s neutral cognitive state. EEG measurements were taken three times for each participant for each HRI interface, for SHRIMP and PS3. This follows recommended EEG practice, the three samples for each interface for each participant are averaged to reduce the inherent noise in data channels, minimising the potential for spurious data leading to errors in the raw data.

Obtaining EEG measurements from a subject while operating a robot.
Each EEG channel was sampled at a rate of 250 Hz and all signal processing techniques were done using Matlab and EEGLab open-source Matlab toolbox. The recorded digitized EEG data were filtered by a 1 Hz high pass and a 30 Hz low pass filter, followed by an artefact (noise) rejection routine. For the purpose of artefact rejection, we examined the time domain manually for unusual spikes and excursions that relate to environmental noise. Figure 17 shows our experimental setup at the console with EEG sensor cap. The EEG literature [1,3,15,18] indicate that specific channels relate to cognitive workload, these are EEG channels F3, F4, Fz, C3, C4, P3, P4, Pz, and Oz. These are the channels we use in our analysis.
To visualize the cognitive state variations between SHRIMP, PS3 and Control conditions we plotted the power spectra from the channels into a Power Spectral Density (PSD) graph. To do this we took the 90 EEG measurements obtained from the 10 participants (three measurements of the three conditions, SHRIMP, PS3 and Control, for each participant). We averaged together the three runs of each individual participant’s trial and computed a power spectrum using a Fast Fourier transformation. The results are illustrated in Fig. 18.

Grand average of EEG activity over the frequency spectrum 1–30 Hz.
To compare the power spectra between each of the test conditions – control, SHRIMP and PS3 we calculate the area under each of the curve in the power spectra density graph. Since our area of interest lies with only the Theta and Alpha frequency bands we calculate the power densities only at these frequency bands. Figures 19 and 20 show the PSD graphs for the Theta and Alpha bands respectively. Theta and Alpha power densities are summarised in Tables 2 and 3 below.

Overall EEG activity in Theta band (4–8 Hz).

Overall EEG activity in Alpha band (8–12 Hz).
EEG spectral power in Theta band
EEG spectral power in Alpha band
We analyse the data for statistical differences with an ANOVA (Analysis of Variance). The ANOVA test is carried out on both Theta and Alpha bands to validate the significance of spectral power between each of the three test scenarios (Table 2). To do this we must first calculate the individual spectral power values for each participant for each test scenario for both Theta and Alpha, the results are summarized in Tables 4 and 5 below. The ANOVA test for Theta yields a p-value of 0.34 (p > 0.05) which indicates that there are no statistically significant differences between the average spectral power values of SHRIMP, PS3 and Control. In other words, average Theta activities among SHRIMP, PS3 and Control have relatively no variations.
Individual Theta spectral power values
Individual Theta spectral power values
Individual Alpha spectral power values
t-test summary of Alpha activity variations
The ANOVA test for Alpha yields a p-value of 0.000291 (p > 0.05), indicating that there are statistically significant differences in Alpha spectral power values among the groups. However, this does not inform us what pairings are responsible for the differences. To do this we use a two-tailed t-test, meaning that we do not need to make any assumptions on the difference between the two sets of values under test.

A summary EEG scalp map at 6.05 Hz the mid-point for Theta activity.

A summary EEG scalp map at 10.08 Hz the mid-point for Alpha activity.
The results of the two tailed t-test between SHRIMP, PS3 and Control are given in Table 6. They show that the only significant difference is between the Control scenario and the SHRIMP and PS3 scenarios (p < 0.05). There is no significant difference between our model, SHRIMP, and the standard HRI PS3 method (p > 0.05). These results are visualised in the summary power spectra density scalp maps shown in Figs 21 and 22 for Theta and Alpha activity respectively. Here the differences are visually quite clear. The difference in Theta levels between the control and SHRIMP and PS3 scenarios in Fig. 21 are clearly seen, blue indicates lower Theta levels (better) in the control than between either SHRIMP or PS3. The Alpha channel illustrates a similar story (Fig. 22), where red indicated higher Alpha levels (better) for the control scenario.
So what do these results say about our model and where does this leave our hypothesis for this case study? One potential criticism of course is that our sample size was too small and the fact that we did not leave open the option of repeating the study again with a larger sample size. However we were expecting to observe some differences. If we take the results as face value, then there are no significant differences in cognitive workload between our model and the PS3 HRI model. Critically though, there are significant differences in cognitive workload between both HRI models and the control scenario, where participants were simply observing the workstation area and engaging in no activity. This at least indicates that our experimental set-up is able to detect differences and that these differences follow intuitive expectation, that a resting state has a lower cognitive workload than when engaged in an activity and this difference is real and measurable. But what about our hypothesis, how do we explain our results away? One explanation is that our set-up does not have the granularity to distinguish different types of cognitive workload. So while SHRIMP and PS3 produce the same levels of cognitive workload, the workload is likely to be quite different in both cases. We can use the evidence gathered from our first case study to back this up. We saw that SHRIMP produced better situational awareness; participants were able to observe the scene in detail while the robot autonomously followed their diagrammatically placed instructions, while in the PS3 scenario they were caught in a tight control loop requiring all of their attention, to operate the robot towards its goal. So we infer from this that while the cognitive load is the same in both cases, participants are applying themselves to two very different cognitive tasks, with SHRIMP, productively to increase their situational awareness, and with PS3, functionally, to simply complete the task, where situational awareness suffers. To prove this point we should re-design the case study and ask participant to close their eyes and rest when they think the goal has been achieved. We would expect participants using SHRIMP would be able to do this as soon as they have completed their task diagram, while participant using the PS3 would have a cognitive load controlling the robot until it had reached the target. In such a study we would expect our earlier hypothesis to be validated.
This paper presented our framework for human robot interaction which is based on the idea of using diagrams as a visual form of communication. The framework has been designed to allow non-experts to control and program complex robots in an intuitive way that is both reliable and accurate. We used modified Augmented Reality methods to implement our framework as an HRI service to participants in two case studies. We compared our model with a standard HRI method which is joystick based.
Future work will extend our implementation to multi-users, multi-mobile devices and develop the vocabulary of our diagram objects to allow for a broader interaction experience. Google’s Project Tango device is also likely to play a significant role in our framework’s future development.
