Abstract
This paper presents a novel multi-stage perception system for collision avoidance in mobile robots. In the here considered scenario, a mobile robot stands in a workspace with a set of potential targets to reach or interact with. When a human partner appears gesturing to the target, the robot must plan a collision-free trajectory to reach the goal. To solve this problem, a full-perception system composed of consecutive convolutional neural networks in parallel and processing stages is proposed for generating a collision-free trajectory according to the desired goal. This system is evaluated at each step in real environments and through several performance tests, proving to be a robust and fast system suitable for real-time applications.
Introduction
In recent years, the field of applied robotics has expanded from factories to other increasingly complex environments, where tasks are less repetitive and require cooperation between humans and machines. Motion planning, which has become a crucial subject in this area, aims to enable robots to automatically compute their movements from different high-level task descriptions acquired through sensors [1].
Since the environment may change at any time, detection algorithms must be robust and fast enough to detect any change involving a modification to previously planned movements. Obstacle detection, a critical stage in navigation tasks, aims to estimate the position and size of every object in the environment that could present a collision risk. In practice, this information is acquired using sensors such as scanners, radars, sonars, or stereo vision cameras [2].
Related works
Deep learning or deep neural networks represent an alternative approach to traditional computer vision techniques. Due to an increase in available storage and computational capabilities, these have won numerous pattern recognition and machine learning contests in recent years [3]. Thanks to its generalization capacity, which enables models to develop new tasks based on previously learned information, deep networks have been applied to diverse perception applications, such as road segmentation in autonomous driving [4], obstacle detection [2, 5], and indoor mobile robot navigation [6, 7].
In [7], a perception system for pedestrian detection and evasion in corridors was developed, which employs a hybrid detector by mixing traditional techniques with Convolutional Neural Networks (CNN), allowing to get a real-time performance of 10 frames per second (fps). In [4], a Faster Region-based CNN (R-CNN) model was implemented in a GPU with post-filtering to detect and consider objects that are common in road-navigation tasks; this model can run at 10 fps using an Nvidia GeForce 980.
Authors in [8] developed a CNN model based on AlexNet [9] by applying several segmentation algorithms combined with the Pascal dataset [10]. In [11], a perception system to detect and plan trajectories along fire areas was developed; by employing a YOLO [12] based neural network trained to locate and classify fire areas (as safe or as hazardous), while a modified version of A
Human-robot interaction has been another topic of interest for many researchers in the field of robotics; understanding how human gestures are related to speech content can be applied to improving human-machine communication. In [17], an architecture that relates phrases and gestures through body expressions was proposed in order to evaluate different language models using semantics. In [18] a methodology for co-verbal gestures recognition employing text and the automatic improvement of execution time in collaboration with humans was developed [19]. Another topic of interest for achieving better human-machine interactions is eye-tracking for controlling gaze-based human-machine interfaces through CNN’s [20, 21]. Likewise, other efforts have focused on improving the perception of artificial eyes, such as the optical flow calculation for multi-aperture sensors [22], and dense depth map estimation from unfocused images [23].
Proposed perception system
Although many works have been developed on the subject of mobile robot perception, most have focused on solving specific problems, e.g., detecting specific objects such as obstacles or targets or generating evasion trajectories with prior knowledge of the environment, which in most cases requires sending information to a computer with higher computational capacity in order to achieve a consistent processing speed. Thus, the development of fast generic embedded systems remains a challenge at present, where the generalization of tasks is also a key point.
This work proposes a novel perception system for mobile robots. This system is composed of three main steps: First, a detection system comprising two neural networks in parallel is employed to detect a set of objects that may be of interest during interaction with the environment (a set of potential targets to reach and a set of potential obstacles to avoid), while a third network is employed to identify gestures from human partners in order to define a single goal to reach when several are available. Second, stereo vision and filtering algorithms are used to recover the 3D position of every detection. Finally, all the information is applied to generate a collision-free trajectory from the robot’s initial position to the desired target (by interpreting the human gestures). All steps were implemented in real-time using a single Jetson TX1 developer board equipped with an Intel Movidius computer stick. Compared with other works, the main contributions of this paper are the following:
A full vision-based perception system is proposed, with different stages for detection, classification, 3D position estimation, and collision-free trajectory generation. The detection system employs RGB images to develop all detections (obstacles and targets), which in turn allows directly using the depth of those pixels containing an object of interest, instead of processing the pixels from the depth map, which would increase processing speed for real-time applications. The system can recognize a set of gestures from human partners commonly used to indicate a desired direction or target. This information is then employed with support algorithms to choose a target to reach when several are available. Two new labeled datasets for supervised learning have been developed: the first includes indoor images for obstacle detection tasks, and the second indoor and outdoor images for human hand-gesture recognition. Each includes thousands of real images taken in different environments. Stages were designed to be as fast and lightweight as possible for its use as an embedded system, capable of running on a Jetson TX1 board with an average processing speed of 15 fps.
Another advantage remains in the capacity of the networks to recognize the object classes, allowing to modify the interaction of the robot depending on the object type, in this case, to consider human navigation orders. Also, the presented obstacle detection network allows detecting obstacles at floor level by employing RGB images directly, which information enables a richer set of features compared with other system types (like radars or laser-based systems), by allowing to extract additional information from the pixels that correspond to an obstacle.
Perception problem scheme. Guided by pedestrian hand signals, a mobile robot must plan a trajectory to a specific target while avoiding collisions with the objects in its path.
The rest of the paper is organized as follows: Section 2 describes the problem definition covered in this work, while Section 3 describes each stage of the detection system, composed by parallel neural networks. Section 4 covers the spatial reconstruction and trajectory generation modules, while Section 5 discusses experimental results and analysis of the whole perception system. Finally, Section 6 illustrates the conclusions and future work of the present paper.
This paper considers the following situation: A mobile robot is standing in an environment waiting for a human command to reach a specific target. When the order is given, and the target is detected, the robot must plan its trajectory while avoiding collisions with the other objects in its path (Fig. 1).
Proposed perception system composed of 6 main blocks. 1: Input block, 2: Preprocessing step, 3: Parallel neural network system, 4: Spatial reconstruction, 5: Decision scheme, and 6: Trajectory generation. The size and dimensions of the processed data are shown at the bottom of each step.
Preprocessing step. Input RGB process scheme.
Within the scope of this work, the following restrictions are considered:
A depth map of the robot’s vision is computed at all times using an efficient stereo system [24]. For terrain robots, it is assumed that obstacles are at floor level; objects above this area are taken as if they were on the floor, which is a common situation for indoor environments. It is considered that the position of obstacles and targets can present slight variations between detection frames; therefore, tracking and filtering algorithms were designed under this limitation.
The above restrictions allow a robot to navigate securely through an indoor setting, where corridors, walls, pedestrians, and the majority of objects usually are touching the floor.
Initially, a single YOLO-based [25] network was proposed to unify all the detection tasks; in practice, however, it had difficulties generalizing hand signal recognition across other specific classes. Moreover, at the first step, an obstacle detection task may not require the class of each detected object, as any class represents a collision risk; the most important information to know is the position and dimension of each obstacle. By considering the previous information, the problem was divided into separate tasks and models; the proposed perception system (summarized in Fig. 2) is composed of 6 main blocks:
Input Block: an RGB image and a depth map are acquired from the environment employing a stereo camera. Preprocessing Step: the input RGB image is processed in parallel in two ways according to the input specifications of subsequent modules. Detection System: the processed images are introduced through the detection system, which consists of three neural networks in parallel;
The first generates an array with the position of the nearest obstacle for every 20 pixels on the image, while the second generates bounding boxes with the position of potential targets. As a second stage, if a pedestrian is detected, a third neural network is employed to determine if a hand signal (left, stay, or right) is given by each of the pedestrians. Spatial Reconstruction: using camera calibration parameters and the information from the depth map, the spatial position of all objects is calculated, employing several filtering and tracking algorithms to deal with noise and other detection problems. Decision Scheme: if multiple hand signals are given by human partners in a multiple target scheme, a simple voting scheme is used by considering each partner’s location to determine a single target. Trajectory Generation: using the RRT algorithm [26], the information gathered in the previous stages is employed to generate a 2-D evasion trajectory from the camera origin position. This trajectory is preserved until the selected goal moves a
Preprocessing step
Through a ZED stereo camera, a 3
For obstacle detection, the input image is divided into 32 columns of width For object detection according to YOLO [12], the same input image is resized to a standard 1
Detection system
To define the detection system, it is necessary first to establish the differences between the object types involved while executing a movement through the environment:
StixelNet, a modified neural network structure. F indicates the number of filters, h and w the size of the kernel on convolutional and pooling layers, N the number of neurons in fully-connected layers, and BN the position of batch normalization layers.
StixelNet, output layer behavior. The network is trained to predict the nearest obstacle position for every 20 pixels of the image.
For obstacles, it is of primary importance to ensure that its position does not present a risk of collision during the execution of a movement; regardless of kind or class, any object is a potential obstacle if the robot can collide with it. For other objects that require some level of interaction, the class is an important fact as the interaction may vary based on the target type in terms of approaching distance, the robot’s final pose, or other practical and security considerations. For hand signal recognition, a prior search for pedestrians may enable fast detection by not processing irrelevant pixels; by focusing the search on the surrounding area where pedestrians have previously been detected (where the configuration of the arms can be directly detected). Thus, this task can be added as a second processing step once a pedestrian is detected in the environment.
By taking the previously mentioned considerations, the general detection task was divided into three sub-tasks; obstacle, object, and hand signal recognition. In this work, a detection system composed of three parallel neural networks was employed to deal with each detection task. Each network will be discussed in detail in the following subsections, including their structures, processing schemes, training and implementation processes.
Neural network structure
The obstacle detection problem may be described as follows: given an input image
StixelNet on some validation images, obstacle detection (left) and probability matrix (right). A darker color in the matrix represents a higher probability 
Tiny Yolo performance of some validation images.
A smaller input image was considered, consisting only of the region bounded by the horizontal pixels The size and stride of all filters were modified to adjust the network to the new work resolution (100 In order to improve the training process, batch normalization [15] and dropout layers [14] were added to prevent overfitting during training.
This network receives a single 3
For this work, an obstacle detection zone with limits between
In order to build the dataset for the network, a ZED stereo camera was used to take pairs of images and depth maps. During the data acquisition process, two different scenes were considered: First, images were taken while passing through hallways and interiors, where people, walls, and different types of obstacles were presented. Second, in a fixed place, the position of different types of obstacles was varied in the range
After the process, 100 images distributed into 3,200 stixels were collected. By taking advantage of both the obstacle detection zone
By carrying out the data augmentation process, 115,500 stixels were generated; 104,000 were used for the training set and 11,500 for the validation step. The neural network was designed and trained using Caffe [27], through 20,000 iterations with a batch size of 64 stixels and a learning rate of 5.25
Target detection
You Only Look Once (YOLO) [12] is a state-of-the-art convolutional neural network that employs bounding boxes to achieve object detection and classification, its most recent version (YOLOV3) was released in 2019. Although there are multiple sets of pre-trained models like Tiny-Yolo for 80 classes, they require a considerable amount of memory to run at an acceptable speed. In practice, due to hardware limitations with the other system modules, it was necessary to retrain a smaller model with a fewer set of classes. In this work, the set of classes c
Neural network structure
In this stage, a short version of Tiny YOLOV3 [12] was employed. This architecture has 13 convolutional layers, with a Max-pool layer placed at the output of each convolution terminating in a set of standard YOLO detection layers which generates a
SignalNet, network architecture. F indicates the number of filters, h, and w the size of the kernel on convolutional and pooling layers, N the number of neurons in fully-connected layers, and BN the position of batch normalization layers.
A set of images taken from the MS-COCO [29], according to the set of 9 classes, were considered to train and test the reduced model: a total of 118,287 images were taken for training and 19,358 for testing. Using the darknet framework [12], the neural network was trained through 500,200 iterations with a batch size of 64 and a learning rate of 2
Hand signal detection
Neural network structure
The hand signal detection problem may be described as follows: If a person is detected, the neural network must determine if any of the available hand signals (go left, stay, or go right) are indicated. By taking a speed performance criterion into account, this neural network was designed to be as small and fast as possible without a (significant) loss of generalization capacity, after attempting multiple classification-based configurations, the structure presented in Fig. 8 was used, which consists of two convolutional layers followed by a fully-connected layer with three neurons at the output, which classifies the input bounding box areas from previous steps in each one of the hand-signal gestures.
SignalNet, output layer behavior.
SignalNet performance on some validation images. The probability output vector is presented below each image (for right, stop and left hand-signal classes).
This network receives a resized 3
Dataset building process
The dataset for this network was built using 2,700 480
The neural network was designed and trained using Caffe [27], through 10,000 iterations with a batch size of 128 images and a learning rate of 2.5
Post-processing step
Spatial reconstruction
Each neural network output
where
Once relevant objects are detected by the previous steps, a filtering and tracking scheme is proposed to deal with the noise related to position estimation (due to incorrect depth calculation by the hardware) as well as multiple detections of a single object (due to movements across the enviroment or to multiple bounding boxes on the same object). In this scheme, for each pair of detection arrays
During the first detection, a constant velocity Kalman filter [31] with states When a new frame arrives, the position of all Euclidean distance, as well as the Intersection Over Union (IOU) measurements, are employed with the Hungarian algorithm to establish correspondences between objects stored in the global dictionary and the incoming frame. For an object with correspondence in a previous frame, its Kalman filter position If a previously detected object does not receive a correspondence during
In this work, by considering a general processing speed of 15 fps, a rate of
Decision scheme
Given a set
Trajectory generation
Once a single target is defined, it is necessary to define a trajectory
A new position of the target A new detected position
By taking the previous considerations, a collision-free trajectory is continuously recalculated once the first target is detected by adapting it to the changes in the environment.
Once a set of points for a collision-free trajectory is calculated, it is necessary designing a controller that enables a mobile robot to follow the desired path. At present several techniques have been applied for that purpose, such as PID (Proportional-Integral-Derivative), feedback linearization, sliding mode, or fuzzy logic controllers.
In [32], an uncoupled linear PID controller for a differential wheeled robot is proposed, by proving the system stability employing the Lyapunov theory, the proposed control is able to track different proposed trajectories. Authors in [33] proposed a robust tracking control for a differential-drive wheeled robot with nonholonomic constraints by employing the feedback linearization technique. In [34], a sliding mode control derivated from a Lyapunov function and feedback linearization is presented, which presents robustness against matched perturbations. Other works that employ a combination of several control laws can be consulted in [35, 36, 37].
On the other hand, there are some cases where the trajectory points are not smooth enough for a desired mobile robot, where a trajectory planning technique is required. To see some works related to this issue, consult [38].
Experimental results
In this section several performance tests were carried out in order to evaluate each perception system module:
Detection accuracy of the three parallel neural networks of the detection system. 3D position estimation by employing detection modules with depth map. A full system evaluation through different real scenarios.
Obstacle detection
Stixelnet was evaluated through a digital validation dataset with the aim of measuring its performance across the whole detection zone
Only those detections that meet the condition For each detection, the average median quadratic error
The validation set was composed of 5,000 images with 10 different obstacles in 10 scenarios across the 50 regions of the detection zone
StixelNet, detection zone evaluation
StixelNet, detection zone evaluation
M.E: Mean Error, A.P: Average Performance.
This may be an effect of the data augmentation process, which leaves this area with fewer examples than the others (which are more likely to benefit from the augmentation process). Nevertheless, since this is the farthest zone, it may not be a problem when executing a movement, as the potential obstacles in that zone will register in closer zones as the robot approaches the target while moving through the calculated trajectory.
Position estimation for several obstacles samples employing the modules of the perception system. Top: Horizontal distance estimation, Bottom: Depth estimation.
To evaluate the performance of the network, the validation dataset from MS-COCO (composed of 3,360 images) was used under the mean Average Precision (mAP) metric. For every image, the network was used to estimate the position and class of all objects; this measurement was then compared to the labels to calculate a performance estimation under the following criteria:
where
YOLO evaluation on dataset
mAP: Mean Average Precision.
The generated validation dataset (composed by 2500 samples) was employed to evaluate the performance of the network; by employing the same scheme as in previous evaluations, a detection is considered correct only if the predicted class is the same as the true label. After evaluations, the network achieved a general performance of 90%, 98%, and 90% for left, stop, and right classes, respectively (Table 3).
SignalNet evaluation results through validation dataset
SignalNet evaluation results through validation dataset
A.P: Average Performance.
The obstacle detection and 3D reconstruction modules were employed to estimate the position of 132 manually measured objects. In this scheme, the median square error
To avoid false positives, only those detections inside a threshold The camera position To get obtain the real 3D position of each object, manually measured distances were taken from the workspace origin (camera). For objects with a stixel in more than one stripe, the average position was employed.
After the evaluations, a median quadratic error of
Real-time implementation
The whole perception system was implemented on an Nvidia Jetson-TX1 developer board [39] equipped with an Intel Movidius compute stick [40], through individual nodes using the ROS framework in C
Perception system, full implementation diagram. The average processing speed of each stage is presented.
Several experiments were carried out to evaluate the whole perception system, varying the position of different types of obstacles, people, and objects according to the following considerations:
First, the system was employed to evaluate different scenarios with measured static pedestrians and objects, in order to evaluate the different models working together while having a measured position of each estimation. Second, the system was employed to track the position of pedestrians moving across the workspace, while the other modules are working to generate a collision-free trajectory according to the conditions mentioned in Section 4.4. As mentioned above, the camera frame was considered as the origin, whereby all the measurements are referred to this point. It is considered the situation to calculate a trajectory from the initial robot position (camera frame) to the desired target, while objects on the workspace are changing its position.
Static objects
Given the above considerations, the perception system was used to generate a trajectory from different representative obstruction problems, according to the following scheme:
Fixed scenarios with pedestrians, objects, and different hand signal configurations indicating a single target to follow were considered. Each scenario was evaluated using the whole perception system, generating an estimated 2D position for each detected object, including a hand signal label for pedestrians, which was then employed to choose a single target to reach. A collision-free trajectory was generated for each case, considering a safety zone of
These results are presented in Fig. 13, which shows that the system can perform the different perception tasks by detecting all obstacles on the floor, all people in the environment, and all hand signals given by each of the latter. Also, the collision-free trajectory generated in each experiment allows reaching the desired target while avoiding collisions with all objects along the path. Figure 13 also shows that the farther an object was during the experiments, the more difficult it was to determine its exact 3D position (by estimating a slightly different position). On the other hand, as can be seen, due to the detection method used, which involved dividing the image into 32 columns, each stixel tends to cover a greater horizontal distance as the depth increases (from the bottom to the top of the image), while the method also implies that obstacles fully occluded by other objects (in vertical position) are not considered during the path planning process (case 2 in Fig. 13). Nevertheless, both cases can be solved using the replanning scheme; by taking new frames when the mobile robot is moving across the path, new views of the obstruction problem will be available, enabling the detection of new obstacles that were occluded in previous frames, as well as of objects registering in closer zones that were previously farther zones.
Experimental results of real-time implementation by considering static objects. OpenCV interface (top), ROS RViz interface (middle), and representative problem (bottom). The above detections (obstacles, targets, and hand signals) are represented with different marker types in each interface. As can be seen, the perception system is able to recognize all objects, as well as the hand signals given by the human partners. Also, the generated trajectory allows reaching the target without colliding with the real obstacles in each scenario (by the considered security zone Q).
Experimental results of real-time implementation by considering moving objects. Tracking: The system is intended to track a single defined target while moving across the environment. Signal detection: The system is intended to recognize and track a target defined by the hand-signals. In both cases, a collision-free trajectory is calculated each time that any of the conditions mentioned in Section 4.4 are met.
As a second evaluation step, a new set of scenarios with moving pedestrians was considered to evaluate the tracking and replanning algorithms, according to the following scheme:
First, a scenario with three moving pedestrians was considered. The system is intended to track each one while defining a collision-free trajectory to a single defined one. Second, two moving pedestrians indicating hand-signals were considered. In this scheme, the system is intended to recognize, track, and calculate a collision-free trajectory to the goal defined by the hand-signals. Once the first trajectory is generated, it is preserved until the conditions presented in Section 4.4 are met, or a new target is defined by the hand-signals given. A target security zone of If a target zone with center at the goal position cannot be defined due to collisions with other obstacles, a new center is calculated by selecting a collision-free random point near the target (inside a distance
These results are presented in Fig. 13. For the first case, the system is able to track the selected pedestrian and generating a collision-free trajectory to reaching it every time that one of the previously mentioned conditions is met. On the other hand, for the second experiment, the hand-sign detection network is able to recognize the gestures given by the pedestrians, allowing to change the target each time that is required, while the position of each pedestrian is correctly employed to set the target to reach. As can be seen, the different modules are able to work together, where the tracking and filtering algorithms allow tracking the position changes of the objects involved in the workspace.
For additional multimedia resources about these experiments or access to the datasets employed to train the networks, visit: drive.google.com/drive/folders/1X-BePp-GPb4Nhf48Co_NBOMTEwGMuwQR.
Conclusions
In this paper, a perception system for collision avoidance in mobile robots was proposed. Using a set of three neural networks in parallel, this system is capable of detecting objects on the floor that could be obstacles while executing a movement, as well as a set of objects that could be targets. This obstacle detection scheme is similar to the human ability to perceive depth without the need to establish correspondences between two points of view, allowing to detect obstacles by employing monocular RGB images directly. The system is also able to recognize human hand signals, which in turn may help to define a single target to reach when several are available. By taking advantage of modern hardware devices, the system can generate a 2D evasion trajectory from the robot’s position to a specifically desired target. To measure its performance, the system was evaluated in each stage through several performance tests according to each detection task, proving to be a robust and fast system suitable for real-time implementations.
As future work, we will consider different approaches for dynamic obstacles and targets based on the objects type, the recognition of a wide variety of human gestures, and new filtering and replanning schemes for considering new obstruction problems and situations with a robot in motion.
Footnotes
Acknowledgments
The authors would like to thank CONACYT project A1-S-10412 Percepción, Aprendizaje y Control de Robot Humanoide and CINVESTAV-IPN for the economic and technological support for the realization of this work.
