Tidy up my room: Multi-agent cooperation for service tasks in smart environments

Abstract

Low-cost robots are usually specialized systems that cannot solve complex tasks, e.g., doing the laundry or tidying up. These tasks are usually solved by more complex and expensive general-purpose systems. The most common problems are the lack of sensors and actuators or computing power to perform multiple functions in parallel. The integration of robots into intelligent environments can help to solve more complex tasks by utilizing the components of the intelligent environment. Such an approach is used in the system to detect pointing gestures from a user and locating objects that a low-priced robot then collects and carries away. This approach is performed by cameras and supported by lights of the intelligent environment. The experimental results show that the cooperation of robots and smart environment increases the success rate of complex tasks in situations where the robot or components of the intelligent environment would underperform.

Keywords

Multi-agent cooperation service tasks tidying-up system

1. Introduction

Robots become more and more applicable in everyday life to relieve people of various tasks or to collaborate with them. They are a great support, especially for people who are unable to perform tasks independently or who are unable to do them at all. A problem for robotics is the range and complexity of tasks. Simple tasks, such as opening doors [14], vacuuming [30] or handing over objects [21], can be performed by simple and specialized robots that have been designed and developed for this purpose. So far, more complex tasks have only been solved in research with expensive general-purpose robot systems. These systems have in most cases an extended and cost-intensive sensor and actuator technology compared to practice-oriented simple robot systems. An alternative is the extension of simple robots by connecting several heterogeneous devices, sensors and actuators of a smart environment and use them to sense and change the state of the environment. Smart environments aim to increase comfort, economy, security and other daily human life factors [3] and can also be a supportive pillar for robot systems. This leads to a powerful system, which is able to solve more complex tasks efficiently.

Fig. 1.

Illustration of the laboratory environment used for both evaluations. The three humans represent the three locations from where the user points at the different object locations. The three colored areas represent the fields of view of the ceiling-mounted smart cameras. Smart lights are used to illuminate the scene and give visual feedback to the user. The different points of view of the same environment were chosen to visualize the various components of the system.

In order to demonstrate the effectiveness of this approach, a use case with a highly practical relevance for users was implemented, namely the instruction of a mobile service robot to tidy up a room. The application consists of several capabilities: (1) an intuitive interaction method based on pointing gestures, (2) cooperative task execution mechanisms for working in different rooms and (3) a situation-aware feedback system.

Thus, a cooperative system consisting of a robot and an intelligent environment to make the system more effective are proposed. In the scenario, which is shown in Fig. 1, different cameras perform the object detection and localization as well as the recognition of human gestures. Smart lamps in the ceiling support the cameras by illuminating a suitable area. Colored lamps serve as a feedback mechanism revealing the status of the system to the user. In addition, a mobile robot performs the physical manipulation of the environment.

The system described here is founded on the results of the previous work [25] as shown in Fig. 1(a). This was only comprised of different smart cameras. In this work, the system is extended by several additional components, which are described below. The contribution of this paper is the integration and evaluation of these additional components to expand the original idea and enable novel use cases. The new composition of the entire system is illustrated in Fig. 1(b). Therefore, the original system is extended by a robot with a gripper arm that can pick up objects. In addition, the markers that were used initially as objects are replaced by real objects, and several controllable lights are installed and integrated into the system as physical agents. The lights are used to illuminate the scene such that the images of all cameras are bright enough for the detection of objects and the recognition of human gestures. Furthermore, they are used as a channel for interaction between the system and the user. For this purpose, the colored light approach from a previous research [19] is used. In order to show the effectiveness of the approach, the results of an extended evaluation concerning the success rate of the tidying-up system and the influence of the proposed lighting system ae provided.

This paper is structured as follows: first, an overview of related works concerning tidying-up systems, ubiquitous robots, pointing gesture detection and its applications is given. Afterwards, a cooperative smart environment is introduced that is evaluated in the following section in terms of the accuracy and success rate of the approach. At the end, the work and present future work are concluded.

2. Related work

Fig. 2.

Overview of the cooperative behavior of the tidying-up system.

While the general idea of tidying-up robots is attractive for users [4], the applied research on this topic is limited. Abdo et al. [1] present a general purpose robot that put away different types of objects based on the prediction of user preferences. Yamazaki et al. [31,32] present a tidying-up robot system based on the cost-intensive PR2.1

PR2 (Personal Robot 2) is a robotic platform developed by Willow Garage – http://www.willowgarage.com.

The robot can carry a tray, collect clothes and put them into the washer, or sweep the floor. Hornung et al. [11] use the NAO, a humanoid robot, to collect items scattered on the floor, move obstacles out of the robot’ way, and put the objects to specified target locations.

Integrating robots into intelligent environments is a cost-effective alternative to the previously mentioned expensive general-purpose robot systems. There is no unified term for this integration but some terms that are closely related to each other. Kim et al. [12] introduce the notion of ubiquitous robots that deals with the embedding of robots into a ubiquitous space. This space features a high degree of connectivity between several heterogeneous components classified as software components, embedded components and mobile components like mobile robots. The term physically embedded intelligent systems (PEIS) coined by Saffiotti et al. [22] summarizes the vision of integrating different devices, e.g., smart cameras and mobile robots, into a joint space opening the opportunity for more advanced robot applications. Nor and Mizukawa [16] propose a similar system called Kukanchi that is an intelligent space for home-based robotic services. Another term is used by Pyo et al. [20] who developed an architecture for an informationally structured environment. The environment consists of sensors embedded into the environment, which continuously monitor and provide information, such as the position of objects, to other agents in the environment, e.g., mobile robots.

None of these approaches, except of a preliminary work in the Kukanchi project [17], allows the non-verbal interaction with the system employing human pointing gestures. However, since the interaction is intuitive, natural and does not require additional devices, human pointing gestures are extensively applied in human-robot interaction systems. For example, Tölgyessy et al. [26] use human pointing gestures to control an autonomous mobile robot equipped with a RGB-D sensor. Furthermore, they evaluated different pointing vectors and found out that the pointing vector described by the elbow-wrist line is more accurate than the shoulder-wrist and wrist-palm lines. Van den Bergh et al. [28] developed a real-time hand pointing gesture detection system for interaction with a robot. The pointing direction indicates a navigation goal for the mobile robot. They use a RGB-D sensor mounted on the mobile robot for the hand posture recognition. A similar approach is proposed by Droeschel et al. [6] who present a method for pointing gesture recognition using a Time-of-Flight camera mounted on a domestic service robot. Additionally, they use a laser scanner to keep track of the interacting person by detecting the person’s legs. Thus, a person does not need to be at a predefined position for interacting with the mobile robot. Yan et al. [33] use a pointing gesture together with verbal interaction to deal with situations where ambiguities about the execution of task exist, e.g., which object should be grasped. Pateraki et al. [18] use pointing gestures in combination with face poses and a priori information about object positions to estimate pointed targets. Other works in this field of research comprise the visual interpretation of pointing gestures in 3D space [13], the probabilistic detection of pointing gestures [23], the probabilistic optimization of robot pointing gestures [9] and the visual pointing gesture recognition for human-robot interaction [15]. The mentioned works have in common that the gesture-based interaction is directly performed between human and robot. They do not consider any other sensors of an intelligent environment. Thus, they are limited to the field of view of the robot’s on-board camera. Various interfaces are being developed for human interaction with smart environments. Some approaches use verbal methods to query the system or get feedback from the system [10,27]. Other approaches are based on visual data to interact through motion. Varkonyi-Koczy and Tusor [29] use hand-postures to interact with the building, and Abid et al. [2] use dynamic sign language for interactive applications in smart homes. Zhang et al. [34] present an alternative sensor. Their approach uses conductive paint to create a smart wall that can detect gestures and touches as well as the use of electronic devices.

Fig. 3.

Processing pipeline for the human pointing gesture detection task.

3. Cooperative smart environment

In order to address the limitations of simple robots, a cooperative smart environment with the focus on a tidying-up application is proposed. The user interaction is based on human pointing gestures to determine objects to be picked up by a mobile robot and carried away. The gesture-based interaction method for the localization of objects considers the whole intelligent environment as interaction partner instead of a single camera mounted on a mobile robot or in the environment. The tidying-up system consists of multiple stages as depicted in Fig. 2. Human pointing gesture detection is performed on RGB-D images obtained by RGB-D cameras mounted in the environment. This first step yields a pointing position in 3D space as well as certain human pose characteristics that are used for the calculation of a probabilistic region of interest (ROI) in the next step. The ROI assigns a probability to each position in the environment to be the position pointed at by the user. See Section 3.2 for more details. Based on this information, RGB cameras in the environment are asked to search for and localize objects in their fields of view, where the selection of the cameras depends on the spatial overlap between their fields of view and the probabilistic ROI. Therefore, only cameras that are likely to see the object pointed at by the user are activated. The same applies for the selection of controllable lights to illuminate the area around the pointing position and support the detection process. Afterwards, the position of the object with the highest global probability is determined. Thus, this step refines the initial pointing position. This object position is finally sent to a mobile robot to pick up the object and carry it to its correct location, e.g., a shoe belongs into a shoe rack. An exemplary video of the whole system can be found online at https://youtu.be/xj1QiAkN2fQ.

3.1. Pointing gesture detection

The detection of human pointing gestures is realized using a RGB-D camera as part of the intelligent environment. The purpose of the approach is to allow users to point at certain locations in the environment and to determine the corresponding pointing position on the ground plane. Therefore, several image processing and geometry transformation steps need to be performed as depicted in Fig. 3. At first, humans and their body parts are detected using the RGB image of the camera. To this end, the OpenPose library [5] is employed, which is a real-time multi-person detection system. The library operates on single RGB images and outputs the estimated locations of humans and their body parts including face and fingers in 2D image space. In case of detecting a pointing gesture, especially, the elbow $p_{el, i}$ and wrist $p_{wr, i}$ locations are interesting.2

²
The index $i \in {l, r}$ indicates if the body part is part of the left or right body side.

Moreover, a confidence

c_{el, i} \in [0, 1]

and

c_{wr, i} \in [0, 1]

is calculated by the OpenPose library to indicate the confidence of the detection. The confidence indicates the probability a pixel on the image is a certain joint. One of the forearms for pointing gestures is preferred instead of one of the fingers because (1) the detection of hand keypoints takes additional computational power and (2) is less accurate in low-resolution images. In order to obtain a 3D position in space

P = {(X, Y, Z)}^{T}

from a 2D image coordinate

p = {(x, y)}^{T}

, the corresponding depth value Z at the position

p

is considered and the inverse central projection

π^{- 1} (p, Z)

of the pinhole camera model is employed:

\begin{matrix} (1) & \begin{matrix} P & = π^{- 1} (p, Z) \\ = {(\frac{x - c_{x}}{f_{x}} Z, \frac{y - c_{y}}{f_{y}} Z, Z)}^{T} \end{matrix} \end{matrix}

The parameters

f_{x}

and

f_{y}

are the focal lengths in pixels, and

(c_{x}, c_{y})

is the principal point in image coordinates. These specific parameters have been determined beforehand during calibration of the RGB camera. Using the transformed 2D body part locations, a 3D pose of a human is obtained. This is used to extract both forearms as lines in 3D space:

\begin{matrix} (2) & \begin{matrix} \vec{x_{arm, i}} = \vec{{OP}_{el, i}} + d \cdot \vec{{ew}_{i}}, d \in R \\ \vec{{ew}_{i}} = \vec{{OP}_{wr, i}} - \vec{{OP}_{el, i}} \end{matrix} \end{matrix}

A direction vector between elbow and wrist is represented by

\vec{{ew}_{i}}

, and its length is controlled by the variable d. A position vector of a point P is denoted as

\vec{OP}

. Afterwards, it is determined if the user wants to point at a location or not. In order to solve this ambiguity, a “key gesture spotting” mechanism is employed. In case that the user wants to point at a location, he or she needs to raise one of his or her forearms in an upright direction with respect to the ground plane while he or she points at the location simultaneously with the other forearm. The ground plane is expressed as a set of position vectors

\vec{x}

fulfilling the following equation:

\begin{matrix} (3) & (\vec{x} - \vec{u}) \cdot \vec{n} = 0 \end{matrix}

where

\vec{u}

is an arbitrary point on the plane and

\vec{n}

is a normal vector of the plane. The angle α between forearm and ground plane is calculated as follows:

\begin{matrix} (4) & sin (α) = \frac{| \vec{{ew}_{i}} \cdot \vec{n} |}{| \vec{{ew}_{i}} | \cdot | \vec{n} |} \end{matrix}

If α exceeds a certain threshold, one of the user’s forearms is assumed to be upright with respect to the ground plane. In the following, it is assumed that the left forearm is upright, and the right forearm is used for the pointing gesture as depicted in Fig. 3. This is done for reasons of clarity and is not a limitation of this work because both interaction opportunities are implemented. Thus, the user can point with the left or right forearm. In case of a left upright forearm, the intersection point

P_{I}

between the right forearm and the ground plane is calculated using:

\begin{matrix} (5) & \begin{matrix} \vec{{OP}_{I}} = \vec{{OP}_{el, r}} + d \cdot \vec{{ew}_{r}} \\ d = \frac{(\vec{u} - \vec{{OP}_{el, r}}) \cdot \vec{n}}{\vec{{ew}_{r}} \cdot \vec{n}} \end{matrix} \end{matrix}

The resulting pointing position

P_{I}

and certain human pose characteristics are used in the next step to calculate a ROI.

3.2. Probabilistic region of interest

The definition of the ROI for object localization and the selection of controllable lights is obtained using a probabilistic approach to deal with the uncertainty of the pointing gesture. The approach is based on a Gaussian function where the parameters are calculated based on the pointing position and human pose characteristics. The basis of the Gaussian function in the two-dimensional domain is defined as: $\begin{matrix} (6) & g (x, y) = A \cdot e^{- (a {(x - x_{0})}^{2} + 2 b (x - x_{0}) (y - y_{0}) + c {(y - y_{0})}^{2})} \end{matrix}$ In order to consider the direction of the pointing forearm, the rotation θ of the Gaussian function must also be taken into account. θ is calculated from the x and y components of the direction vector $\vec{e w}$ between elbow and wrist:3

³
From now on, the direction vector of the pointing forearm is denoted independent of the body side.

\begin{matrix} (7) & θ = arctan 2 (\vec{{ew}_{y}}, \vec{{ew}_{x}}) \end{matrix}

It follows for a,b, c:

\begin{matrix} (8) & \begin{matrix} a = \frac{cos {(θ)}^{2}}{2 σ_{x}^{2}} + \frac{sin {(θ)}^{2}}{2 σ_{y}^{2}} \\ b = \frac{- sin (2 θ)}{4 σ_{x}^{2}} + \frac{sin (2 θ)}{4 σ_{y}^{2}} \\ c = \frac{sin {(θ)}^{2}}{2 σ_{x}^{2}} + \frac{cos {(θ)}^{2}}{2 σ_{y}^{2}} \end{matrix} \end{matrix}

The standard deviation for the x-axis

σ_{x}

with respect to the pointing gesture depends on the confidence of the detected body parts, i.e., the confidence of the elbow

c_{el}

and wrist

c_{wr}

. For example, the less the confidence for the body parts, the greater the variance along the x-axis. The standard deviation along the y-axis

σ_{y}

depends on the distance between wrist and depth sensor

d_{wd}

because a high distance entails a high uncertainty. Additionally, both standard deviations are affected by the distance between the wrist and the pointing position

d_{wp}

. All these factors are taken into account by calculating the standard deviations as follows:

\begin{matrix} (9) & \begin{matrix} σ_{x} = \frac{λ \cdot d_{wp}}{c_{el} \cdot c_{wr}} \\ σ_{y} = λ \cdot d_{wp} \cdot d_{wd} \end{matrix} \end{matrix}

An additional scaling parameter λ is introduced to control the width of the Gaussian function according to the approach. While this variable depends on the object size and the pointing gesture drift, the exact relation is not examined yet. For the experimental evaluation, the value of the variable was determined exploratively. To this end, the parameter was tested within the interval

] 0, 10]

until the width of the Gaussian function fits the experimental setup best. The amplitude coefficient A is selected to normalize the volume under the Gaussian function

V = 1

to obtain a probability density function:

\begin{matrix} (10) & A = {(2 π σ_{x} σ_{y})}^{- 1} \end{matrix}

Finally, the Gaussian function culminates at the pointing position

P_{I}

. Thus,

x_{0}

and

y_{0}

were set accordingly.

An example of the approach is depicted in Fig. 4. The black rectangle illustrates the forearm and the red line shows the virtual pointing line in extension to the forearm. The upper graph illustrates the three-dimensional representation of the gesture and its corresponding probability distribution in the two-dimensional domain. The higher the probability value of a position, the more likely the position is the location pointed at by the user. The illustration assumes an exemplary position of the elbow at $(0, 0, 1.2)$ with a confidence of $c_{el} = 0.9$ and the position of the wrist at $(0.15, 0.3, 1.1)$ with a confidence of $c_{wr} = 0.4$ .

Fig. 4.

Colored probabilistic region of interest based on the pointing gesture (red) of the forearm (black). The width of the Gaussian “bell” is different for the axis along the pointing direction resulting in an ellipsoid shape. The center of the Gaussian is the calculated pointing position.

In order to determine the responsibility of a camera or a controllable light for a given ROI, the volume $V (F)$ under the Gaussian function limited by the contour F of the camera’s field of view or the controllable light’s illumination area is calculated. Since the integral for specified boundaries of a Gaussian function cannot be solved analytically, the volume is calculated numerically. Furthermore, the numeric approach also enables the calculation of probabilities for more complex contours that cannot be calculated analytically. For the numerical integration, points within a contour C are spaced evenly: $M (C) = {c \in R^{2} | Φ (c, C)}$ , where $Φ (c, C)$ describes a function that checks if a point c lies within a contour C. The cumulative probability for a certain contour is calculated as follows: $\begin{matrix} (11) & V (C) = \sum_{p \in M (C)} g (p) \cdot δ_{x} \cdot δ_{y} \end{matrix}$ where the distances between the evenly spaced points are indicated by $δ_{x}$ and $δ_{y}$ . This cumulative probability has to exceed a certain threshold Θ to activate the corresponding camera for object localization or controllable light for illumination.

In case of an activation of a camera, the camera calculates a likelihood for each object o in its field of view. An object o was modeled as a polygon describing the object’s contour similar to a contour of a camera’s field of view. In order to consider the size of an object and prevent unequal treatment in favor of large objects, the mean probability of an object o is calculated as follows: $\begin{matrix} (12) & P (o) = \frac{1}{| M (o) |} \sum_{p \in M (o)} g (p) \end{matrix}$ where the sum of probabilities is divided by the number of sample points $| M (o) |$ to normalize smaller and larger objects.

Fig. 5.

Light interaction protocol for the light activation.

3.3. Situational lighting

Because the environmental parameters, e.g., brightness, temperature, etc., are oftentimes not adequate to carry out a specified task, a situation-aware approach is shown that reacts to difficult conditions using the example of brightness and lighting. The approach is founded on an agent-oriented approach and uses the Agent Communication Language (ACL) [7] to guarantee the common representation of message contents as well as a mutual understanding of interaction protocols across the different agents. The network of light agents is a set of homogeneous controllable lights $L = {l_{1}, l_{2}, \dots, l_{6}}$ which are equally spaced on the test area. Because it can be assumed that the inhabitants of a smart environment want an intelligent behaviour where certain lights are only switched on when they are really needed, only the relevant lights are switched on. In the scenario, this are the lights where the object is expected. The corresponding interaction protocol is shown in Fig. 5. Whenever a pointing gesture is recognized, the camera initializes the light interaction protocol and sends a participation broadcast to the $m = | L |$ lights. The request includes a call to calculate the light-specific responsibility value based on the given ROI, the Gaussian function as well as a threshold Θ that specifies the lower boundary for the responsibility. Subsequently, each light $l \in L$ is supposed to reply either with a refuse or agree message depending on the result of the responsibility calculation. At this point it is possible that $b = n - g$ lights agree, which triggers an activation of all b agreeing lights, where n is the number of replying lights and g is the number of refusing lights. Because it can happen that not every light replies ( $n ⩽ m$ ), the interaction protocol takes this into account by setting a deadline which terminates the request procedure if $t_{response} (l) > t_{deadline} (request)$ . Hereafter, every a light that agrees to the request sends an informative message about whether everything could be carried out as desired or whether an error occurred, given that $f ⩽ a$ and $i = a - f$ , where f is the number of failing lights and i is the number of succeeding lights.

3.4. Distributed object localization

In order to refine the gesture-based object localization, the system relies on a distributed object localization mechanism that is also founded on an agent-oriented approach as introduced in Section 3.3. The network of smart camera agents is a set of heterogeneous cameras $C = {c_{1}, c_{2}, \dots, c_{n}}$ that consists of a subset of RGB-D cameras denoted as $D = {d_{1}, d_{2}, \dots, d_{m}}$ and a subset of RGB cameras denoted as $R = {r_{1}, r_{2}, \dots, r_{k}}$ such that $C = D \cup R$ and $D \cap R = \emptyset$ . The RGB-D cameras are used for the pointing gesture detection task as described in Section 3.1 followed by the determination of the probabilistic ROI as described in Section 3.2. Because the ROI is only an approximate position of the actual object, the RGB cameras of the intelligent environment are used to precisely localize the object that is likely to be pointed at by the user. Therefore, the approach uses a two-stage interaction protocol as shown in Fig. 6. Whenever a pointing gesture is recognized by a RGB-D camera $d \in D$ , the camera acts as an initiator of the protocol by broadcasting a participation request to the $m = | C |$ cameras of the system. The request consists of an invitation to determine the responsibility for the given ROI, a tuple of the parameters of the Gaussian function as well as a threshold Θ that specifies the lower boundary for the ROI. Following the invitation, every RGB camera $r \in R$ , calculates its responsibility for the lower bounded probabilistic ROI based on the parameterized Gaussian function and the a priori specified field of view of the camera (see Equation (11)). Because the system is designed as a loosely-coupled network, the response of all m participants cannot be guaranteed. On this account, the request procedure is temporally bounded by a deadline $t_{deadline} (request)$ . The deadline guarantees that every reply of a participant $r \in R$ is neglected when $t_{response} (r) > t_{deadline} (request)$ . If the constraint for the camera selection $V (F) > Θ$ holds, where F is the field of view of the camera $r \in R$ specified as a two-dimensional contour, the corresponding participant $r \in R$ agrees to the request of the initiator. If the constraint fails or an error occurs, the corresponding participant refuses to the request. The subset of cameras $R_{accept} \subseteq R$ , where $b = | R_{accept} |$ , that agrees to the invitation of the initiating RGB-camera $d \in D$ , now receives a CFP (Call for Proposal). The CFP initializes the second stage of the protocol that is designed as an auction-based approach, which means that the camera assignment is carried out employing a bidding process inspired by auctions. Within this stage, every RGB camera $r \in R_{accept}$ detects and localizes all identifiable objects $O = {o_{1}, o_{2}, \dots, o_{k}}$ within its field of view and calculates the mean probability $P (o_{i})$ for each object (see Equation (12)). The detailed object detection and localization is described in Section 3.5. As a result, a contour of each object found on the ground is received. Following the object localization and probability calculation, the most likely object within the field of view of each RGB camera $r \in R_{accept}$ is proposed to the initiating RGB-D camera $d \in D$ as part of the bidding process. The subset of proposing participants is denoted as $R_{propose}$ . If no object can be detected and localized, the participant refuses the cfp, thus $R_{propose} ⊊ R_{accept}$ . For both cases holds that an answer is only considered if $t_{response} (r) < t_{deadline} (cfp)$ , where $t_{response} (r)$ is the point in time of the response and $t_{deadline} (cfp)$ is a deadline for the CFP. Afterwards, the initiating RGB-D camera $d \in D$ selects the participant $r_{w} \in R_{propose}$ that proposed the maximum global probability. Subsequently, the initiator accepts the proposal by sending an accept-proposal message to the related participant with the request to send the proposed object. Consequently, the proposals of the other u participants in $R_{propose}$ are rejected. Subsequently, the selected camera $r_{w} \in R_{propose}$ informs the initiator by sending the contour of the object specified as a three-dimensional polygon and finishes the procedure.

Fig. 6.

Camera interaction protocol for the distributed object localization.

3.5. Object detection and localization

In this work, the detection of visual markers as objects is replaced with a more enhanced object detection module. Therefore, the system can currently detect two different object types: (1) a shoe and (2) a cup. These types of objects was chosen because they are small enough to be carried by a low-cost mobile robot and they are realistic objects for a tidying-up scenario. The detection of the objects is performed in two stages. Each RGB camera integrated into the environment searches for objects in its field of view whenever the camera is requested to detect objects (see Section 3.4). This detection is based on local color changes in the image and morphological properties of detected segments. In order to determine the 3D position in world space for each detected object, the contour of the object in image space is projected onto the ground plane. This is possible because the objects are assumed to be on the ground plane. The probability for a detected object is calculated according to Equation (12).

The second stage comprises the local detection and recognition of objects on the mobile robot when the robot approaches an object. Therefore, the mobile robot’s on-board RGB-D camera is used. Additional to the color information used to recognize an object, the depth image is employed to accurately determine the position of the object for grasping.

3.6. Object grasping robot

The approach involves cooperation between robots and an intelligent environment. In the example application, a mobile robot with a gripping system is used to pick up the found objects. Similar to the previous approach [24], the robot first uses map-based navigation to reach the target position determined by the camera agents. Once the object has been detected by the robot’s on-board camera, the navigation switches to a visual-servoing approach. This is used to position the robot appropriately in front of the object. Based on optical and geometric data from an RGB-D camera, the object is identified and a grasping variant is selected. The robot then approaches the object until it is within reach of the manipulator. Due to the short range of the robot arm, the object is outside the robot’s field of vision during the grasp. The robot lifts the object and centers the arm in front of the camera to check whether it has successfully picked up the object. If it fails, the robot performs a recovery behavior by moving back, relocalizing the object again and performing another gripping attempt. This behavior is performed three times before the robot aborts the task. Once the robot has successfully gripped the object, it moves its arm into a transport pose where the object is not hanging in front of the camera, as this is required for localization. Finally, the object is transported to a user-defined position depending on the object type and deposited. The positioning is negligible for the evaluation.

4. Evaluation of the multi-stage object localization

The multi-stage object localization system based on human gestures was evaluated concerning four criteria to prove the appropriateness of the approach. The first criterion is the drift of the pointing gesture indicating how accurate the system can detect pointing gestures with respect to a given target position. Afterwards, the success rate of the object selection based on the probabilistic ROI was assessed. As third criterion, the drift of the object position determined by the last step of the multi-stage object localization where RGB cameras mounted in the environment localize the selected object was evaluated. Finally, the pointing gesture detection time was measured.

4.1. Experimental setup

Fig. 7.

Experimental setup describing camera, person and target positions.

The experimental setup used for the evaluation is depicted in Fig. 7 as a top view of the environment. At the origin of the coordinate system, a RGB-D camera is mounted in a height of 1.79 m and a pitch angle of $26^{\circ}$ to detect human pointing gestures. An Asus Xtion was used as a RGB-D camera with a color image resolution of $640 \times 480$ pixels and depth image resolution of $160 \times 120$ pixels. Pointing gesture detection was performed on a common desktop PC running at 3.4 GHz (Intel(R) Core(TM) i7-3770 CPU, 16 GB RAM, GeForce GT 630 GPU). Besides, two of the three RGB cameras mounted on the ceiling (2.95 m height, pitch angle of $90^{\circ}$ ) were used for object localization. Their fields of view are visualized in green and blue. An overlap between both fields of view is shown in yellow. The RGB cameras are Microsoft LifeCams with an image resolution of $1920 \times 1080$ pixels. All RGB sensors are calibrated, i.e., their intrinsic camera parameters are known and their relative transformations with respect to the world frame are determined. Finally, different visual markers [8] were placed as objects at all target positions $T 1$ – $T 12$ on the ground to indicate them to the participants. Visual markers were chosen in this evaluation because they can be robustly detected and the evaluation criteria are independent of the object type. However, real objects were selected for the evaluation of the tidying-up system as described in Section 5.

Fig. 8.

Drift of the pointing gesture depending on the person position (from left to right: $P 1$ to $P 3$ ).

4.2. Experimental procedure

A participant performed pointing gestures from three different person positions ( $P 1$ – $P 3$ ), each position with a different distance to the RGB-D camera (1.5 m–2.5 m, 0.5 m steps). The participant was asked to point towards the target positions $T 1$ – $T 12$ from each person position $P 1$ – $P 3$ sequentially. The time between the start of the pointing gesture and the detection of the gesture was measured. The actual position pointed at by the user was referred to as pointing position. Afterwards, the system calculated a probabilistic ROI and selected an object based on its probability. In the next step, the system localizes the object using the real-time ArUco library [8]. This position was referred to as object position, that is determined by one of the RGB cameras. Each combination consisting of a person and a target position was evaluated 10 times per participant. Thus, each trial with a participant provided $3 \times 12 \times 10 = 360$ pointing and object positions. The experiment was performed with 3 participants separately, because the approach is error-prone to overlapping people in the image space. Since the difference between the positions is essential for the understanding of the evaluation, they are summarized in short:

Person position: This is the position of the human, and in the experimental setup there are three person positions with different distances to the RGB-D camera denoted as $P 1$ – $P 3$ .

Target position: This is the position to be pointed at by a participant, and in the experimental setup there are twelve target positions denoted as $T 1$ – $T 12$ .

Pointing position: This is the actual position pointed at by a participant. If pointing and target position are close to each other, this results in a small pointing gesture drift.

Object position: This is the position of the object pointed at by the user calculated in the distributed object localization step. If object and target position are close to each other, this results in a small object localization drift.

Fig. 9.

Discrepancy between elbow-wrist and wrist-palm vector.

Fig. 10.

Success rate of the object selection depending on the person position (from left to right: $P 1$ to $P 3$ ).

4.3. Pointing gesture drift

In order to assess the pointing gesture drift, the Euclidean distance between the pointing and target positions is considered. Figure 8 visualizes the drift of the pointing gesture depending on the three person positions. The smallest drift is reached from person position $P 2$ with a mean distance of $M = 0.55 m$ of the pointing position from the corresponding target position. The mean values for the other person positions are slightly worse ( $P 1$ : $M = 0.64 m$ , $P 3$ : $M = 0.61 m$ ). It is also apparent that the drift increases, the more the target position is away from the person position. While pointing gestures are accurate for nearby target positions, e.g., 0.23 m drift for person position $P 1$ and target position $T 7$ , they can significantly differ for target positions more than 3 m away from the person position, e.g., 0.91 m drift for person position $P 1$ and target position $T 9$ . One main reason for this effect was observed: participants incorrectly estimated the direction of their pointing forearm as visualized in Fig. 9. Thus, they thought they would point at the right target position (green dotted line) but actually pointed at a different position (red solid line). This has a strong effect on the drift when target objects are more far away because a minimal error of the pointing angle has a larger effect. This leads to an increasing drift with an increasing distance between person and target position. The effect is reduced by the subsequent steps of the multi-stage approach which are discussed in the next subsection. Overall, a mean pointing gesture drift of $M = 0.60 m$ can be reported considering all person positions.

4.4. Object selection success rate

This subsection aims to evaluate the success of the object selection based on the probabilistic ROI. For example, if the participant points at target position $T 4$ , the system should consider this object as object pointed at by the user and not an object at another target position. Thus, the success rate of the object selection is reported depending on the person position as shown in Fig. 10. In general, these results can be explained by the accuracy of the pointing gesture. The smaller the pointing gesture drift, the higher the success rate to select the right object because the probabilistic ROI culminates at the pointing position. Thus, the area around the pointing position has a high probability for the occurrence of an object. Therefore, especially the nearby-located target positions have a high success rate, e.g., target position $T 1$ ( $P 1$ : 100%, $P 2$ : 93.3%, $P 3$ : 96.7%). In contrast to that, the success rate of the object selection dramatically decreases for target positions more than 3 m away, e.g., target position $T 6$ ( $P 1$ : 30.0%, $P 2$ : 30.0%, $P 3$ : 13.3%). It is also apparent that the object selection success rates for target positions $T 3$ and $T 12$ are often high although the pointing gesture accuracy is small for these target positions. This is an advantage of the probabilistic ROI that assigns adequate probabilities to these target positions. The overall success rate for the object selection is 64.7%.

4.5. Distributed object localization drift

The final step of the multi-stage object localization system calculates the position of the object pointed at by the user to refine the initial pointing position. Therefore, the Euclidean distance between object and target position was considered to assess the drift of the distributed object localization step. Since this step is independent of the person position, the mean results for all three person positions is visualized in Fig. 11. This data also contains the 35.3% incorrectly selected objects. The drift ranges between 0.02 m for target position $T 3$ and $0.42 m$ for target position $T 2$ . This drift depends on the object selection success rate as well as the correct calibration of the RGB cameras. The higher the success rate of the object selection and the more accurate the camera calibration, the smaller the drift of the distributed object localization. The mean drift for all target positions is $M = 0.22 m$ .

Fig. 11.

Drift of the distributed object localization. Note that a different color scaling compared to the pointing gesture drift results shown in Fig. 8 was used.

4.6. Pointing gesture detection time

During this experiment, the time between the start of a pointing gesture (raising one forearm) and the detection of the gesture by the system was measured. This time highly depends on the hardware configuration running the pointing gesture detection. The pointing gesture detection time for the hardware configuration as described in Section 4.1 is visualized in Fig. 12. The boxplots reveal that there are no significant differences of the detection times for the different person positions. Thus, the detection time is independent of the distance between a human and the RGB-D camera. The mean detection time for all person positions is 4.0 seconds. Another characteristic is that the interquartile range is 0.7 seconds on average ( $P 1$ and $P 2$ : 0.7 seconds, $P 3$ : 0.8 seconds) which underlines the robustness of the pointing gesture detection approach. However, there are also some outliers that feature a detection time of more than 5 seconds. These can occur due to inaccurate depth measurements which result in a delayed recognition of a raising forearm.

Fig. 12.

Pointing gesture detection time depending on the person position.

4.7. Discussion

The experimental results show that the system can successfully detect pointing gestures with a mean drift of $M = 0.60 m$ . Furthermore, the distance between the person and target position is a crucial factor for the drift. A pointing gestures up to approximately 2 m can be detected accurately ( $M = 0.39 m$ ). For larger distances, the drift increases ( $M = 1.02 m$ ), which leads to inaccuracy in determining the correct object. However, the probabilistic modeling of a ROI around the pointing position allows the selection of correct objects employing a distributed object localization system. Finally, the initial pointing position is refined by the object position calculated by the RGB cameras. The mean drift for the distributed object localization is $M = 0.22 m$ . Thus, the multi-stage object localization approach refines the initial pointing gesture drift of $M = 0.60 m$ to $M = 0.22 m$ on average. In order to detect pointing gestures, the system needs approximately 4.0 seconds which is a reasonable time. However, this value highly depends on the hardware configuration, and currently the computation time is 1.2 seconds to perform pointing gesture detection on a single image. Consequently, the pointing gesture must be held for at least 1.2 seconds for a gesture to be recognised. If this time can be reduced, e.g., by updating the hardware configuration with a strong GPU, the pointing gesture detection time can be reduced.

5. Evaluation of the tidying-up system

In a second experiment, the success rate of the tidying-up system was determined. The first criterion here was the performance of the overall system, how often the application run successfully, and which errors occurred. The second criterion was the influence of the support of the lighting agents on the success rate of the system.

5.1. Experimental setup

Fig. 13.

Setup showing cameras, lights, robot and objects.

In this evaluation, the experimental setup was extended by further actuators, which are shown in Fig. 13 as a top view of the environment. In addition to the ceiling cameras, six ceiling lamps ( $L 1$ – $L 6$ ) were installed (2.95 m height). Philips Hue LED downlights, which each illuminated a $2.5 m \times 2.5 m$ area on the floor, were used. In addition, the environment was extended by two cupboards in which the carried objects could be placed. Two colored LED strips were installed on each of them, and two additional colored Phillips Hue Iris lamps were mounted in the environment. These colored lights give feedback to the user. All lamps communicate via the ZigBee protocol and each has its own software agent to control and communicate. The robot used was a modified Turtlebot 2 with a Lynxmotion AL5D manipulator with four degrees of freedom and a two-finger gripper. An Intel RealSense D435 sensor was installed for local object recognition, navigation and grasp planning. The camera was located below the arm, and the Z-axis of the image plane was aligned parallel to the X-axis of the mobile robot. Finally, the markers from the first experiment were replaced with two different objects: a shoe and a cup. For each of the two objects, a cupboard was defined as the storage location in the system.

5.2. Experimental procedure

In this part of the experiment, the interaction with the user through pointing gestures was omitted because the recorded pointing positions from the first part were used again. The pointing positions which led to incorrectly selected objects in the previous evaluation (see Section 4.4) were also part of this experiment. One real object, i.e., shoe and cup, was placed at a time on each target position ( $T 1$ – $T 12$ ). The robot always started from an initial pose, which could be for example the charging station in a real life application. In each attempt, the recorded pointing position was sent to the system, and the final status was stored according to the following categories:

Object Detection Error: The ceiling camera agents could not detect any objects on the surface.

Robot Navigation Error: The robot could not reach a target pose, e.g., the position of the object, or was unable to find an object at the target coordinates.

Robot Object Error: The robot failed three times in a row to grasp the object.

Wrong Object Error: The robot misidentified the object and put it in the wrong cupboard.

The evaluation took place in two different lighting situations. In both situations, bright and dark environment, the experiment was carried out once with the light agent switched on and off. This resulted in a total of 192 experiments.

5.3. Results

Fig. 14.

Final status of all tries per scenario and per failure category.

The results of the performance evaluation depending on the scenarios and failure categories are visualized in Fig. 14. They reveal that the success rate in a bright environment is higher (with light system: 79.2%; without light system: 72.9%) than in a dark environment (with light system: 58.3%; without light system: 4.2%). The influence of the light system is small in case of the bright environment (difference: 6.3%), but has a significant impact on the success rate under bad lighting conditions (difference: 54.1%). This indicates that situational use of additional actuators of an intelligent environment can help support robotic tasks. In detail, the most significant error under good lighting conditions is the Robot Object Error which can be attributed to errornous sensor data occurring in the grasping process. In contrast to this, the Object Detection Error and the Robot Navigation Error play a major role in dark environments. The reasons for this is that the object detection of the ceiling cameras is strongly influenced by noise resulting in the detection of no objects. This noise is caused by darkness and a weak light output of the smart lights. Furthermore, the Robot Navigation Error is the result of the detection of incorrect or non-existing objects leading to wrong robot navigation goals.

In summary, the results show an improvement of the success rate under poor lighting conditions when using a cooperative and situation-aware smart environment.

6. Conclusion & future work

A cooperative multi-agent system for service tasks in smart environments was proposed. As an example, a tidying-up system was implemented due to its highly practical relevance. It consists of a mobile robot, several distributed cameras and smart lights. Human pointing gestures were used to select objects and a multi-stage object localization approach was used to accurately determine the location of these objects. Unlike other approaches in the area of human-robot interaction, that only allow gesture-based interaction with single robots, the entire intelligent environment was used as interaction partner. Hence, the system is no more limited to the field of view of a single robot by taking advantage of multiple smart cameras mounted in the intelligent environment. Additionally, smart lights are employed to support the camera-based object detection, e.g., if an area is too dark, and to act as feedback channel revealing the system’s state through colored light. The evaluation of the system examines different scenarios and shows that the approach can significantly increase the success rate of the tidying-up system under bad lighting conditions.

Future work should focus on the reduction of the pointing gesture drift for object distances more than 3 m away from the user and the recognition of different kinds of objects. The latter requires an advanced recognition module based on semantic segmentation and the integration of further grasping approaches. Additionally, the limitation that the gesture recognition is error-prone to crowded environments will be addressed. Furthermore, the integration of a speech recognition system could avoid ambiguities concerning the object selection when multiple different objects are located close to each other. Additional actuators and sensors in the environment could help with further applications, e.g., presence detectors to locate the user within an environment or smart cupboards that open themselves when the robot approaches. In order to improve the performance of the system, an optimized recover behaviour is planned in case the robot cannot see the object or cannot grasp it correctly. Finally, the system can be adopted to other service tasks, such as fetching and delivering medicine or food, as well as pointing to areas for vacuum spot cleaning. Due to the agent-based architecture, this can easily be achieved by integrating new skills, agents and task routines.

Footnotes

Acknowledgements

This work is financially supported by the German Federal Ministry of Education and Research (BMBF, Funding number: 03FH006PX5).

References

Abdo,

Stachniss,

Spinello and

Burgard, Robot, organize my shelves! Tidying up objects by predicting user preferences, in: IEEE International Conference on Robotics and Automation (ICRA), 2015, pp. 1557–1564. doi:10.1109/ICRA.2015.7139396.

M.R.

Abid,

E.M.

Petriu and

Amjadian, Dynamic sign language recognition for smart home interactive application using stochastic linear formal grammar, IEEE Transactions on Instrumentation and Measurement (TIM) 64(3) (2015), 596–605. doi:10.1109/TIM.2014.2351331.

J.C.

Augusto,

Callaghan,

Cook,

Kameas and

Satoh, Intelligent environments: A manifesto, Human-centric Computing and Information Sciences 3(12) (2013).

Bugmann and

S.N.

Copleston, What can a personal robot do for you? in: Towards Autonomous Robotic Systems,

Groß,

Alboul,

Melhuish,

Witkowski,

T.J.

Prescott and

Penders, eds, Springer, Berlin Heidelberg, 2011, pp. 360–371. doi:10.1007/978-3-642-23232-9_32.

Cao,

Simon,

S.-E.

Wei and

Sheikh, Realtime multi-person 2D pose estimation using part affinity fields, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

Droeschel,

Stückler,

Holz and

Behnke, Towards joint attention for a domestic service robot – Person awareness and gesture recognition using time-of-flight cameras, in: IEEE International Conference on Robotics and Automation (ICRA), 2011, pp. 1205–1210.

Foundation for intelligent physical agents, 2017, http://http://www.fipa.org.

Garrido-Jurado,

Munoz-Salinas,

F.J.

Madrid-Cuevas and

M.J.

Marín-Jiménez, Automatic generation and detection of highly reliable fiducial markers under occlusion, Pattern Recognition 47(6) (2014), 2280–2292. doi:10.1016/j.patcog.2014.01.005.

Gulzar and

Kyrki, See what I mean – Probabilistic optimization of robot pointing gestures, in: IEEE-RAS 15th International Conference on Humanoid Robots (Humanoids), 2015, pp. 953–958. doi:10.1109/HUMANOIDS.2015.7363484.

10.

Han,

Hyun,

Jeong,

Yoo and

J.W.

Hong, A smart home control system based on context and human speech, in: International Conference on Advanced Communication Technology (ICACT), 2016, pp. 165–169.

11.

Hornung,

Böttcher,

Schlagenhauf,

Dornhege,

Hertle and

Bennewitz, Mobile manipulation in cluttered environments with humanoids: Integrated perception, task planning, and action execution, in: IEEE-RAS International Conference on Humanoid Robots (Humanoids), 2014, pp. 773–778.

12.

J.H.

Kim,

K.H.

Lee,

Y.D.

Kim,

N.S.

Kuppuswamy and

Jo, Ubiquitous robot: A new paradigm for integrated services, in: IEEE International Conference on Robotics and Automation (ICRA), 2007, pp. 2853–2858.

13.

Li and

Jarvis, Visual interpretation of natural pointing gestures in 3D space for human–robot interaction, in: International Conference on Control Automation Robotics Vision (ICARCV), 2010, pp. 2513–2518.

14.

Meeussen,

Wise,

Glaser,

Chitta,

McGann,

Mihelich,

Marder-Eppstein,

Muja,

Eruhimov,

Foote,

Hsu,

R.B.

Rusu,

Marthi,

Bradski,

Konolige,

Gerkey and

Berger, Autonomous door opening and plugging in with a personal robot, in: IEEE International Conference on Robotics and Automation (ICRA), 2010, pp. 729–736.

15.

Nickel and

Stiefelhagen, Visual recognition of pointing gestures for human–robot interaction, Image and Vision Computing 25(12) (2007), 1875–1884. doi:10.1016/j.imavis.2005.12.020.

16.

N.S.M.

Nor and

Mizukawa, Robotic services at home: An initialization system based on robots’ information and user preferences in unknown environments, International Journal of Advanced Robotic Systems 11(7) (2014), 112. doi:10.5772/58682.

17.

N.S.M.

Nor,

N.L.

Trung,

Maeda and

Mizukawa, Tracking and detection of pointing gesture in 3D space, in: International Conference on Ubiquitous Robots and Ambient Intelligence (URAI), 2012, pp. 234–235.

18.

Pateraki,

Baltzakis and

Trahanias, Visual estimation of pointed targets for robot guidance via fusion of face pose and hand orientation, Computer Vision and Image Understanding 120 (2014), 1–13. doi:10.1016/j.cviu.2013.12.006.

19.

Pörtner,

Schröder,

Rasch,

Sprute,

Hoffmann and

König, The power of color: A study on the effective use of colored light in human–robot interaction, in: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2018, pp. 3395–3402.

20.

Pyo,

Nakashima,

Kuwahata,

Kurazume,

Tsuji,

Morooka and

Hasegawa, Service robot system with an informationally structured environment, Robotics and Autonomous Systems 74(Part A) (2015), 148–165. doi:10.1016/j.robot.2015.07.010.

21.

Rasch,

Wachsmuth and

König, Understanding movements of hand-over between two persons to improve humanoid robot systems, in: IEEE-RAS International Conference on Humanoid Robots (Humanoids), 2017, pp. 856–861.

22.

Saffiotti,

Broxvall,

Gritti,

LeBlanc,

Lundh,

Rashid,

B.S.

Seo and

Y.J.

Cho, The PEIS-ecology project: Vision and results, in: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2008, pp. 2329–2335.

23.

Shukla,

Erkent and

Piater, Probabilistic detection of pointing directions for human–robot interaction, in: International Conference on Digital Image Computing: Techniques and Applications (DICTA), 2015, pp. 1–8.

24.

Sprute,

Pörtner,

Rasch,

Battermann and

König, Ambient assisted robot object search, in: Enhanced Quality of Life and Smart Living: 15th International Conference on Smart Homes and Health Telematics, 2017, pp. 112–123. doi:10.1007/978-3-319-66188-9_10.

25.

Sprute,

Rasch,

Pörtner,

Battermann and

König, Gesture-based object localization for robot applications in intelligent environments, in: International Conference on Intelligent Environments (IE), 2018, pp. 48–55.

26.

Tölgyessy,

Dekan,

Duchoň,

Rodina,

Hubinský and

Chovanec, Foundations of visual linear human–robot interaction via pointing gesture navigation, International Journal of Social Robotics 9(4) (2017), 509–523. doi:10.1007/s12369-017-0408-9.

27.

Vacher,

Guirand,

Serignat,

Fleury and

Noury, Speech recognition in a smart home: Some experiments for telemonitoring, in: Conference on Speech Technology and Human–Computer Dialogue (SpeD), 2009, pp. 1–10.

28.

Van den Bergh,

Carton,

De Nijs,

Mitsou,

Landsiedel,

Kuehnlenz,

Wollherr,

Van Gool and

Buss, Real-time 3D hand gesture interaction with a robot for understanding directions from humans, in: IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), 2011, pp. 357–362.

29.

A.R.

Varkonyi-Koczy and

Tusor, Human–computer interaction for smart environment applications using fuzzy hand posture and gesture models, IEEE Transactions on Instrumentation and Measurement (TIM) 60(5) (2011), 1505–1514. doi:10.1109/TIM.2011.2108075.

30.

Vaussard,

Fink,

Bauwens,

Retornaz,

Hamel,

Dillenbourg and

Mondada, Lessons learned from robotic vacuum cleaners entering the home ecosystem, Robotics and Autonomous Systems 62(3) (2014), 376–391. doi:10.1016/j.robot.2013.09.014.

31.

Yamazaki,

Ueda,

Nozawa,

Mori,

Maki,

Hatao,

Okada and

Inaba, System integration of a daily assistive robot and its application to tidying and cleaning rooms, in: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2010, pp. 1365–1371.

32.

Yamazaki,

Ueda,

Nozawa,

Mori,

Maki,

Hatao,

Okada and

Inaba, Tidying and cleaning rooms using a daily assistive robot – An integrated system for doing chores in the real world, Paladyn, Journal of Behavioral Robotics (2010).

33.

Yan,

He,

Zhang and

Zhang, Task execution based-on human–robot communication and pointing gestures, in: Advanced Mechanical Science and Technology for the Industrial Revolution 4.0,

Yao,

Zhong,

Kikuta,

J.-G.

Juang and

Anpo, eds, Springer, Singapore, 2018, pp. 37–46. doi:10.1007/978-981-10-4109-9_5.

34.

Zhang,

C.J.

Yang,

S.E.

Hudson,

Harrison and

Sample, Wall++: Room-scale interactive and context-aware sensing, in: CHI Conference on Human Factors in Computing Systems, CHI’18, 2018, pp. 273:1–273:15.

Tidy up my room: Multi-agent cooperation for service tasks in smart environments

Abstract

Keywords

1. Introduction

3.1. Pointing gesture detection

2 The index i ∈ { l , r } indicates if the body part is part of the left or right body side.

3 From now on, the direction vector of the pointing forearm is denoted independent of the body side.

3.4. Distributed object localization

3.6. Object grasping robot

4. Evaluation of the multi-stage object localization

4.1. Experimental setup

4.4. Object selection success rate

4.5. Distributed object localization drift

5. Evaluation of the tidying-up system

5.1. Experimental setup

5.3. Results

Footnotes

Acknowledgements

References

²
The index $i \in {l, r}$ indicates if the body part is part of the left or right body side.

³
From now on, the direction vector of the pointing forearm is denoted independent of the body side.