Abstract
Path planning for human beings is based on scene understanding and semantic map as well as geometry features. The point cloud map can help robots to perform path planning. However, it is quite different from the natural way of human beings. In this paper, we propose to construct a novel framework for topological semantic map. Specifically, we construct a 2D semantic map by projecting 3D scene semantic information recognized by convolutional neural network onto a 2D plane. 3D reconstruction of the environment is achieved by RGB-D SLAM 3D space mapping algorithm. The intersections in the 2D map are recognized offline, and the semantic annotation of the intersection in the topological map is utilized to build up a complete object-based semantic map. Experimental results demonstrate the feasibility of the proposed approach.
Introduction
Conventional SLAM (Simultaneous Localization and Mapping) algorithms [1, 2] are able to construct dense maps and record robot trajectories when dealing with relatively small-scale scenes. Nonetheless, when handling large-scale scenes where local environments might change with time, the robustness and computational complexities cannot be satisfactorily guaranteed. Semantic and topological information are the counterparts depended by humans when executing SLAM task in a large environment. They are both effective cues that might enhance the robustness and reduce the computational complexities for a more intelligent and versatile SLAM algorithm [3, 4].
In order to fulfill this philosophy, [5] proposes a system that can assign each meaningful region in the scene a specific label such as to facilitate the robot to recognize and locate the desired place. However, the topological information fails to be involved into their algorithm which might potentially make the robot ineffective to plan the path toward a location far from its present position. In comparison, [6] choose to utilize LSD-SLAM (Large-Scale Direct Monocular SLAM) and deep CNN to provide the objects in environment with semantic labels. When extracting the semantic information on the fly, the robot is able to effectively determine the scene class and is competitively robust when performing closure detection tasks. However, topological and semantic information have not been integrated into mapping process simultaneously. Accordingly, the mapping results might not incorporate sufficient meaningful abstracted information for the robot to handle semantic SLAM. Therefore, we attempt to propose a framework that takes both of them into consideration. Specifically, the RGB-D SLAM algorithm is applied to construct a 3D realistic map. The 3D map can be projected to 2D plane to acquire the 2D grid map at the same time when constructing the semantic map. We utilize the VGG-Net16 to realize the capability of scene understanding and assign the classification result to the corresponding regions of the grid map. Thus, we are able to undertake the semantic segmentation operations for real scenes. Additionally, we adopt Bayesian filter to bring into the spatial-and-temporal-consistency priori to enhance the generalization ability of semantic segmentation. In order to improve system robustness, we add topological information in final mapping result. In this paper, the road intersections are considered as topological nodes. By defining to determine “what a stable topological node is”, the robot can automatically generate a topological map with intersections as its topological nodes. Finally, the proposed framework can successfully build a map fusing both topological and semantic information.
Related work
Vision-based SLAM can be widely applied in many application fields such as driverless car and AR. It is a popular direction of research in recent years [7]. Typical visual SLAM algorithm always targets on estimating the pose of the camera, and reconstructing the 3D map by using multiple view geometry theory. To efficient data processing, some visual SLAM algorithm extract sparse image features, then inter frame estimation and loop closure detection by means of matching between the feature points. Such as visual-base SLAM based on SIFT (scale- invariant feature transform) features and visual-based SLAM based on ORB-SLAM (oriented FAST and rotated BRIEF SLAM)features. Depending on their robustness, excellent classification and fast processing feature, SIFT and ORB are widely-used in visual-base SLAM fields. However, artificially designed sparse image features are far too confined by itself. On the on hand, how to design sparse image features to optimally show image information is still an unsolved significant problem in computer vision field. There are many challenges for sparse image features on its reflection on illumination variation, motion of targets and change of video-camera parameters and lack of texture environment.
To solve these problems, in the field of visual SLAM, a hierarchical image feature extraction method represented by deep learning technology, has appeared in recent years and has been successfully applied to SLAM inter frame estimation [8, 9] and loop closure detection [10, 11]. It relies on the hierarchical feature representation of multi-layer neural network learning images, and can achieve higher recognition accuracy compared with traditional recognition methods [12, 13]. At the same time, deep learning can also associate images with semantics, generate semantic maps of environments with SLAM technology [14, 15], construct semantic knowledge, bases of environments [16] for robots to perform cognition and task reasoning to improve robot service intelligence and human-computer interaction. In a recent work, Trevor et al. [17] introduced a single scene point cloud segmentation utilizing connected components practices through RGB-D data. First a planar segmentation step is performed on the point cloud data to distinguish the dominant planes in the scene. Then an L2 norm based clustering and a connected component labeling mask are applied on the color image, in order to detect objects on a tabletop [18]. Presents spatial 3D eigenvectors for a single scene classification, which operates on pre-captured frames. In a separate interpretation of the scenario, Buschka et al. [19] gradually divided the grid of indoor environments into two types of open spaces (rooms and corridors). Mozos et al. [7] applied AdaBoost classifier to classify places in into rooms, corridors and doorways. In addition, the author in [20] utilized visual input to infer about the category label of the detected objects during the robot’s perambulation. This piece of information was treated in a hierarchical fusion manner to further to characterize the observed scenes in accordance with the existing objects. In [21], deep convolutional network (CNN) complemented with a series of one-vs-all classifiers is used for visual semantic mapping. In [22], CNNs are used to classify grid maps built from laser data into 3 classes: room, corridor, and doorway.
This paper aims to add semantic information and fork node topology information to the semantic SLAM. We proposed to create a map combining 2D maps, 3D maps, and fork topology information.

System overview.
Construction of two-dimensional semantic map
An improved VGG-net16 [23] is applied to generate the probabilistic distributions of 205 known object classes. We utilize the Bayesian filter to guarantee the time consistency property of the acquired distributions and incorporate the observations as priori information. The object class estimated through the filtered probabilistic distribution is then used to label the corresponding region of the grid map built by Kinect’s depth information.
Scene Classification based on CNN
The module to understand the scene captured by Kinect and to provide classification labels is based on the VGG-net16 proposed by [23]. VGG-net16 is trained on the Places-205 dataset which comprises 250 semantic classes consisting of 2.5 million pictures, 5 thousand pictures for each class on average. These samples are obtained through platforms like Google, Bing, and Flickr and labeled manually. The abundant data ensures the generalization capacity of the VGG-net16 such that it can perform well in various application cases.
VGG-Nets is originally advocated for object classification task [24]. They have very deep convolutional architectures with smaller sizes of convolutional kernel (3×3), stride (1×1), and pooling window (2×2). There are four different network structures, ranging from 11 layers to 19 layers. An intuitive impression is that the performance of the model is proportional to the depth of the network. Thus, [23] first adopts an 11-layer network as initialization and subsequently extend the model to the deeper network.
For an arbitrary RGB image captured by Kinect, we choose a 224×224 region individually from the four corners and the center. A horizontal flipping operation is imposed on the five selected regions. Through inputting the ten derived regions into the ConVnet, the final classification result is acquired as the average over the ten output values.
Semantic mapping in a probabilistic manner
It is easy for us to integrate other information when combining the priori knowledge to transfer the scene classification problem into a counterpart of Bayesian estimation. For instance, as is known, the 205 object classes included by Places 205 dataset is not possible to be observed in most environments. In this paper, we consider the scene recognition problem as a probabilistic estimation one, where the known scene is defined as
As for the label of overall scenes, we apply the VGG-net16 above to estimate the discrete distribution
Where is the discrete distribution related to the scene of the n-th class provided with the t-th image. c0 ∼ c n is mutually independent of each other. The adjacent two frames captured by the Kinect have the consecutive timestamp, which allows us to utilize the Bayesian estimation method.
Assuming above mentioned problem satisfies the first-order Markov property, a more consistent scene recognition result can be further acquired. Subsequently, the following equation can be given:
Furthermore, the scene recognition problem can be regarded as a Bayesian estimation problem if we combine the priori knowledge. Thus, other available priori information can be incorporated into the computation. For example, if we know in advance that most scenes among the 205 scenes appear with an extremely low probability, the priori probability can be computed as (4)
Finally, all the unoccupied regions observed by the Kinect are updated and labeled with the estimated scene class.
We depend on the RGB-D SLAM platform advocated in [25] for real-time 3D reconstruction, which is a graphics-based SLAM system consists of the front-end and back-end.
The front-end is responsible for extracting the spatial relations between distinct observations. Specifically, it can undertake the feature detection and the descriptor extraction on each RGB frame. Through finding the feature correspondence between two frames and the corresponding depth information, we are allowed to obtain the correspondence between the voxels derived from depth information. Accordingly, the 6D motion can be computed and optimized. Specifically, in each iteration, we compute the RGB values and depth related to the adjacent two frames. We further perform feature detection and descriptor extraction as well as the feature matching such as to obtain the set of matched 2D feature pair. The feature pair is two pixels respective from the two frames. We can use such feature pair to acquire the corresponding projection coordinates and depth information. The coordinates of the corresponding voxel can be computed. Then, we can use the RANSAC method to estimate the camera motion between the two frames. With this motion parameters as initialization, we can further apply the ICP algorithm to optimize the motion parameters and finally get the optimal motion sequence.
The back-end uses a nonlinear error function to optimize the camera position. Specifically, it uses the graph optimization to initialize the position depending on the motion parameters acquired from the front-end. Additionally, it considers the closure detection as the constraints and finds the optimal position result with the nonlinear error function. The camera motion trajectory and the 3d reconstruction can be obtained as well.
In our proposed framework, RGB-D SLAM takes two crucial roles: 1) construct 3D map suitable for robot navigation; 2) Project the RGB-D map to 2D planes and acquire the grid map.

RGBD-SLAM diagram.
The main gist of mapping is to enable the robot to achieve localization and navigation task. Humans can finish path planning without knowing his accurate position, realizing the object information in the environment, but relying on merely the memory of the scene and the matching capacity. This is the humans’ talent to extract the topological and semantic map of the scene and navigate with it.
The main aim of mapping by the robot is to achieve localization and navigation task successfully. Humans can finish path planning in case of unknowing his accurate position and the object information in the environment, but relying only on the memory of the scene and the matching capacity. Localization by this way is very similar with that by topological semantic map. By the same token, the robot can also determine the position of the robot by recognizing the surrounding environment and its semantic information of the fork in the road. The significance of this method is to detect the fork nodes in the 2D semantic environment map offline and construct the topological semantic map of the fork nodes.
The definition of fork in topology node
The fork point is the center of the intersection between different channels, such as the corridor corner, doorway, intersection and other areas, are called the node area. We mainly through the 2D map of the location of the fork in the number of channels to determine the type of fork in the road.
Calculation method of path width in node area
When the robot detects a branch node, it is necessary to calculate the position coordinates of the branch node and the minimum width of each path in the node area.
In Fig. 3 when the robot detects a branch node, it is necessary to calculate the position coordinates of the branch node and the minimum width of each path in the node area.

A schematic diagram for detecting the topology of a fork road.
In Fig. 4 (b), the no-touch sector path in the robot coordinate system is “I”, and it is assumed that the “a” and “b” in the left and right block sectors of the “I” sector are expressed as (5) with a point column.

Calculation of the width of each path in the node area.
Through (6) we get the matching point pair with the shortest distance between the point column O(i-1) and the point O(i+1) and thus the vertical line segment of the two sets is obtained. The distance from this vertical line is the narrowest width of the path. Figure 4 (a) gives an iterative transformation process in which the direction of the arrow is iterative. The final channel information can be expressed by Kr matching, and the size of the Kr is related to the number of local minimum values in the node area.
In Fig. 3, point rows O(i-1) and O(i+1) at a1 and a2 of the doorway position can be fitted with two parallel straight lines. Then the points on both columns are local minimum points. Similarly, the points on the a2 and a3 position in the corridor are also the local minimum points.
But the actual calculation process is that the point columns O(i-1) and O(i+1) obtained through the depth camera may be discontinuous at the boundary point at the blocking sector. As shown in Fig. 4 (b), the shortest matching point pairs and of blocking sectors a and b are not in the point column O(i-1) and O(i+1) but are obtained by fitting the dotted columns of obstacle sectors a and b. Therefore, by calculating the points fitted by the point columns O(i-1), O(i+1) the accuracy and robustness of calculating matching point pairs are improved.
The type of intersection-oriented topological nodes
The robot can detect a road only when it scans a region where there is no occlusion (the green parts in Fig. 4 (a)). The problem resolved in the section is to determine the type of the topological nodes corresponding to each intersection where two or more roads join. First of all, we have to eliminate several spurious intersections depending on the angle threshold. Specifically, we compute the included angles of two detected roads and stipulate that the included angle θ must be smaller than a threshold θth2 i.e., θ≤ θth2 This refers to the case where the robot is in a straight road and can scan two regions without occlusion (the regions indicated by the blue parts as shown in Fig. 5). Otherwise, the alteration θ′ of the included angle θ is larger than a threshold

Eliminate several spurious intersections.
If the intersection is not the spurious counterpart, we can further determine the corresponding type of the topological nodes. This mainly depends on the number of ways in the fork node. After constructing the 2D map, we can extract keyframes along the trajectory of the robot. For each of such points, we can scan its surroundings and find whether there are roads. If two or more roads are detected, this point is considered as an intersection point. Then we utilize the region around this intersection point to compute a topological node. However, if the width of a road is smaller than 0.5 m or the length of the road is smaller than 1 m, we remove such a region which might be misunderstood as a candidate for computing topological node.
The centroid of the common region
As mentioned above, the topological node is computed by the region around the intersection point. But there exists the situation where two or more intersection points are extremely close to each other (as shown in Fig. 4 (b)). Accordingly, we have to integrate these intersection points into a single one. The method to compute the centroid of the common region around these intersection points is required.
The main idea is first individually to fit a straight line that can represent each road. When having a lot of straight lines, we can further compute their intersections. The mean value of the coordinates related to these intersections is used as the final coordinate of the topological node.
As for the way to fit each road, we propose to depend on the discrete points along both sides of a road. For instance, we define the set of discrete points along one side of the road as O1 and the set corresponding to the other side as O2 We select two points o1 ∈ O1 and o2 ∈ O2 satisfying that: ∀o i ∈ O1, ∀ o j ∈ O2 s.t. dis (o i , o j ) ≥ dis (o1, o2) where dis(.) denotes the L2-norm of two coordinates. When finding these two points, we can construct a line segment whose ends are these two points. The perpendicular bisector of such a line segment is the straight line required to fit the road. For each road, we can find a representative straight line in the same way.
Experiment platform
In our experiment, Turtlebot is used as the mobile platform and Kinect is applied to capture the RGB image and to acquire the depth information. We utilize a 2.3 GHz Intel Core i7 laptop as the processor. The whole framework in our paper is developed based on ROS (Fig. 6).

Experiment platform.
As shown in the legend of Fig. 9, we use six colors to represent six different semantic environments respectively. Figure 9 shows the real-time mapping results of the 3D semantic map for six different environments in three large scenes using this algorithm. This 3D semantic map contains 3D point cloud maps, 2D navigation maps and scenes semantic maps.
The three groups of maps respectively show the result of semantic mapping of the office floor, the lobby and the cafeteria, and the experimental area floor.
During the whole process, we first clip the 224×224 regions respectively from the center and the four corners and execute horizontally flipping operations. We further use these acquired ten views as input to go through the VGG-Net16 using CAFFE. The output of the VGG-Net16 determines the scene class of the present image. Meanwhile, RGB-D SLAM algorithm can generate the grid map. Finally, the estimated class is utilized to label the region of the grid map corresponded to the present image.
Experiment 2
We construct the intersection-oriented topological map in an off-line manner (Fig. 7). The location of the blue point in the figure is the calculated position of the fork node.

Schematic diagram of the calculation results of fork road topology.
After saving a set of data collected by Kinect, we process the result of semantic map recognition by Bayesian filtering (as shown in Fig. 8 (left)) and maximum likelihood estimation (as shown in Fig. 8 (right)) respectively. The experiments have demonstrated that the Bayesian method can effectively guarantee the accuracy of semantic estimation while the Maximum Likelihood method which fails to consider the priori might lead to error estimation.

Comparison of the effectiveness individually resulted by the Bayesian method and the Maximum Likelihood Estimation method.

Scene semantic map with 3D environment information.
The semantic label of the scene and 3D space map are constructed by using convolution neural network and RGBD-SLAM algorithm respectively. The 2D navigation map is set up by projecting the 3d dense map onto the 2D plane with semantic information. We finally construct a topological semantic map with 2D scene recognition and 3D spatial information by extracting off-line topology nodes from the intersections on the 2D semantic map. In the next work, we will annotate the static objects in 3D maps, and integrate their semantic information as landmarks into the closed loop detection framework of SLAM, to improve the stability of the system in complex scenarios.
Footnotes
Acknowledgement
This work was supported by National Natural Science Foundation of China (61772508, U1713213), Guangdong Technology Project (2016B010108010, 2016B010125003, 2017B010110007), Shenzhen Technology Project (JCYJ20170413152535587, JSGG20160331185256983, JSGG20160229115709109), CAS Key Technology Talent Program, Shenzhen Engineering Laboratory for 3D Content Generating Technologies (NO. [2017]476), Key Laboratory of Human-Machine Intelligence-Synergy Systems, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (2014DP173025).
