ESD-SLAM: An efficient semantic visual SLAM towards dynamic environments

Abstract

Traditional visual SLAM algorithms run robustly under the assumption of a static environment, but always fail in dynamic scenes, since moving objects will impair camera pose tracking. Given this, this paper presents an efficient semantic dynamic SLAM (ESD-SLAM), which is suitable for dynamic scenarios. Based on the ORB-SLAM2 framework, the ESD-SLAM we proposed employs lightweight semantic segmentation network FcHarDNet to extract semantic information, and uses the region growing algorithm to optimize the semantic segmentation boundary. Then dynamic objects are removed by combining semantic information with multi-view geometry, and it further improves the localization accuracy. Combining semantic information and depth information, a dense point cloud map of static scene is constructed to serve the planning task of mobile robot. We conduct the experiments on the public TUM RGB-D dataset and in the real-world environment. Experimental results show that the proposed algorithm can improve the performance of the ORB-SLAM2 system in dynamic scenes, and significantly improve the real-time performance compared with other same type dynamic SLAM algorithms.

Keywords

Visual SLAM dynamic scenarios multi-view geometry lightweight semantic segmentation

1 Introduction

Simultaneous localization and mapping (SLAM) [1] is a technique that applies sensors to estimate the current pose and build a 3-D map of the environment without any prior information about the environment. With the development of computer vision, deep learning technology, and the hardware computing ability, visual SLAM [2] has been widely used in autonomous driving, mobile robot, and unmanned autonomous vehicles (UAV). Some advanced SLAM algorithms have achieved satisfactory results, such as ORB-SLAM2 [3], ORB-SLAM3 [4], VINS-Mono [5], LSD-SLAM [6].

However, these algorithms are designed for static scenes. Dynamic objects are very common in real life. These dynamic objects cause many wrong data associations, destroy the constraint relationship of pose between frames, and finally lead to the pose estimation error of the whole system [10, 11]. The standard visual slam eliminates the influence of dynamic objects by random sampling consistency (RANSAC) [7]. However, they fail to accommodate the dynamic scenes where dynamic objects occupy a large area in the image.

This paper focuses on dynamic object feature points removal in dynamic scenes and the static dense point cloud map construction in the visual SLAM system. Major contributions of this paper are:

This paper presents an efficient semantic dynamic SLAM (ESD-SLAM), which bases on ORB-SLAM2 with prior knowledge of semantic segmentation and multi-view geometry. It works satisfactorily in dynamic environment and significantly improves the localization accuracy as well as stability in high dynamic scenes.

The system adopts lightweight semantic segmentation network FcHarDNet [23] for real-time semantic segmentation. Experimental results show that the system achieves a balance between calculation efficiency and localization accuracy in dynamic environment.

Combining semantic segmentation information and depth information, masked depth image is constructed to remove the interference of dynamic objects and build dense point clouds in static scenes.

2 Related work

Currently, most of the visual SLAM system is based on the assumption of static scenes and has poor robustness in dynamic scenes [8, 9]. Dynamic objects in the scenes could cause mismatch of features, which affects the localization accuracy and the success rate of relocation. For this problem, there are mainly two solutions. Pure geometric-based [12 –16] and semantic-based [17 –20] methods. These geometric-based approaches rely on geometric restrictions, such as the equation of epipolar lines and the principle of triangulation, to segment static and dynamic features. They are based on the fact that dynamic features will violate standard constraints defined in the multi-view geometry for static scene. These geometric-based approaches cannot remove all dynamic objects, e.g., people who remains stationary. Features on such objects are unreliable and need to be removed from tracking and mapping.

These semantic-based methods first detect or segment objects and then remove outliers from tracking. DS-SLAM [17] used SegNet [21] network to capture semantic information and then combined it with motion consistency checking to filter dynamic features. This method improved the localization accuracy compared with ORB-SLAM2, but the deviation of the fundamental matrix calculated in the polar constraint was easy to be affected by external points, which influence the system accuracy. DynaSLAM [18], proposed by Bescos Berta was a dynamic robust SLAM algorithm which employed the semantic segmentation results of Mask R-CNN [22] and multi-view geometry to detect moving objects, and recovered the background occluded by dynamic objects. This algorithm improved the accuracy of pose estimation but cannot realize real-time operation. Jonathan proposed DOTMask [19], a highly modular pipeline that tracked and masked dynamic objects through semantic segmentation to improve both localization and mapping in visual SLAM, however, its localization accuracy was lower than DynaSLAM. SaD-SLAM Xun proposed [20] which used semantic information obtained by Mask R-CNN and depth information to discover dynamic feature points. At the same time, static feature points of moving target were detected and used to finetune the camera attitude estimation, which made the algorithm robust and accurate. However, the system operated offline semantic segmentation and could meet the requirements of real-time operation.

Inspired by these works above, the paper proposes an efficient and reliable visual SLAM method for filtering dynamic feature points from real scenes. Experimental results show that our proposed can significantly improve the localization accuracy and real-time performance in dynamic environment.

3 System description

3.1 Framework of ESD-SLAM

Our proposed ESD-SLAM system is built based on RGB-D model of ORB-SLAM2 by adding semantic segmentation thread to the original threads of tracking, local mapping, and loop closing.

The image processing flow chart of our proposed ESD-SLAM is shown in Fig. 1. The RGB image taken by the RGB-D camera is fed into both the tracking thread and the semantic segmentation thread. In the thread of semantic segmentation, FcHarDNet is used to segment RGB image at the pixel level to obtain the semantic label of each pixel. According to semantic label, all objects can be divided into three categories: static objects, dynamic objects, and potentially dynamic objects. Among them, the potentially dynamic objects are those which usually being static but could become dynamic under the influence of other objects (such as chairs and people). Then, the region growth algorithm is used to optimize the boundary of semantic segmentation results, and the optimized results are input into the tracking thread, where the feature points of all three categories are extracted undifferentiated. Potential dynamic feature points can be extracted based on semantic information. Dynamic feature points are removed by using semantic information and multi-view geometry, static feature points are retained to participate in pose estimation.

Fig. 1

Architecture of ESD-SLAM.

3.2 Light weight semantic segmentation

To improve the efficiency of dynamic SLAM, the front-end semantic segmentation network needs to balance efficiency and accuracy. Instead of the computation-intensive Mask R-CNN used in DynaSLAM, Our ESD-SLAM system adopts FcHarDNet to get semantic segmentation results.

FcHarDNet is based on Densely Connected Network [20] and uses a sparse connection, resulting in a set of layers called harmonics dense blocks (HDB). In this block, layers with an index divided by a larger power of two have more weight in the model, which at its turn they are amplified by increasing their number of channels. This last amplification is done to balance the input/output channel ratio and avoid low MoC [23]. After each HDB, there is a conv1x1 layer as the transition layer. The block achieves back-propagation by directly passing the gradient from the output to all previous layers. FcHarDNet speeds up computation by reducing shortcuts between layers in the DenseNet architecture. FcHarDNet changes the weights of layers of DenseNet and extracts more features from more layers connected by shortcuts to maintain the model’s accuracy.

We follow the architecture proposed by the original author, and the model used in this paper is called FC-HarDNet70.As the name describes, the model contains 70 convolutional layers spread amongst 10 HDBs.

The network is based on a Pytorch implementation, trained with the PASCAL VOC 2012 dataset [26]. Objects can be divided into 21 categories, including backgrounds.

3.3 Semantic segmentation boundary optimi-zation based on region growth algorithm

To improve segmentation precision on boundaries, we use the region growth algorithm, which combines the prior information of semantic segmentation, to obtain the actual edge of dynamic objects in-depth images.

The critical problem of the region growth algorithm lies in the selection of growth seed and the determination of growth criteria. At first, in order to determine the growing seed points, we set the pixel points of the same mask as the set:

$\begin{matrix} m = & {{(x_{1}, y_{1})}_{mas k_{i}}, {(x_{2}, y_{2})}_{mas k_{i}}, \dots, {(x_{k}, y_{k})}_{mas k_{i}}} \\ m \in {m_{1}, m_{2}, \cdot \cdot \cdot, m_{n}}, \end{matrix}$

where n is the number of mask classes. In the indoor environment, only human is considered as dynamic object, so only the growing seed points in the human mask area need to be calculated. The position of the growing seed point Pt is calculated as (3), where $(\bar{x}, \bar{y})$ is the mean position of the pixels within the semantic mask region.

$(\bar{x}, \bar{y}) = \frac{1}{k} \sum_{j = 1}^{k} (x_{j}, y_{j})$ (1)

When the stack space for storing seed points is not empty, the seed pixel grows with a certain criterion. The calculation of the depth tolerance Th in the growth rule show in (4), where depthfactor is the scale factor of the depth map. $Th = 0.03 * depthfactor$ (2)

Algorithm 1 shows the specific steps of the algorithm for the region growth algorithm. As shown in Fig. 3, the optimized boundary contour is more refined than the original one.

Fig. 2

FC-HarDNet70 network structure.

Fig. 3

Comparison of boundary optimization.

Algorithms 1: Region Growth
input: The depth image, a seed pixel position Pt, a depth tolerance Th, and a Status variable for each pixel position.
output: a set of pixels region
Initialization: define Region, Neighborhood, V
1. AddPt → V
2. While size (V) is not empty
3. foreachQt ∈ Neighborhood (Pt)
4. and Status (Qt)≠ used Do
5. If abs (Depth (Qt) - Depth (Pt)) < Th,then
6. Qt = 255 and Add to Region
7. Status (Qt)← used
8. Pt ← V
9. Qt → V
10. end
11. end
12. end

3.4 Multi-view geometry

Since the semantic segmentation only eliminate the dynamic objects, we introduce the multi-view geometry method to detect the potential dynamic objects.

The five key frames which have the highest coincidence with the current frame are selected as reference frames. the coincidence degree vs is determined by [25]: $vs = 0.7 \times Δ d + 0.3 \times Δ r$ (3) where vs is the coincidence of frames with frames, Δd, Δr are the change increments of position and attitude, respectively.

As shown in Fig. 4, the feature points x on the previous frames are projected onto the current frame to obtain the feature points x_cur and their projection depth Z_cur, and the corresponding 3-D points X are generated.

Fig. 4

Principle of multi-view geometry.

The depth information and visual angle information are used to detect the dynamic objects. The depth difference ΔZ is calculated based on the actual depth value Z_cur. The parallax angle α is calculated between the back-projections of x and x_cur. Algorithm 2 shows the algorithm of multi-view geometry.

Algorithms 2: Multi-View Geometry
input: Current Frame F, depth image
output: MaskList of Dynamic objects mask
1. //Finds the previous keyframes that are highly
\ similar to the Current Frame.
2. RefFrames vF ← GetRefFrames (F)
3. //Find dynamic points;
4. foreach keypoint x of vFdo
5. Compute the projected point x_cur and projected
\ depth Z_cur in F;
6. Compute corresponding 3D point X;
7. ifΔZ = Z_proj - Z_cur > T_depththen
8. \ \ mask = dynamic;
9. \ \ MaskList ← added;
10. \ \ end
11. \ \ ifα = ∠ xXx_cur > T_αthen
12. \ mask = dynamic;
13. \ MaskList ← added;
14. \ \ end
15. \ end
16. MasksList ← CombineMasks (F, mask);

4 Experimental results

In this section, the experimental results of ESD-SLAM system illustrates its performance on TUM RGB-D dynamic scene dataset [27]. The experiments use two sets of dynamic sequences of sitting and walking. Each set of sequences contains four sequences, which are distinguished by the motion of the camera: (1) halfsphere: the camera moves around a hemisphere with a diameter of 1 m; (2) xyz: the camera moves along the x, y, z axis; (3) rpy: the camera rotates along the roll, pitch, and azimuth axes; and (4) static: the camera is stationary. Sequences in the sitting set are low dynamic sequences, in which two people sit at a table and do a small amount of exercise. While sequences in the walking set are high dynamic sequences, in which two people do a lot of movement around the table. The entire experiments are conducted on a computer with an Intel Core i7-9700k CPU, a GeForce RTX 2080Ti GPU, running on the Ubuntu 18.04 operating system.

4.1 Comparative experiment of extracting feature points

The Fig. 5 shows the comparison of ORB-SLAM2 and ESD-SLAM feature extraction on the two sets of TUM datasets (walking_xyz and walking_halfsphere). The feature extraction of ORB-SLAM2 is shown in the left figure, while the feature extraction of ESD-SLAM is shown in the right figure. It can be clearly seen that the ESD-SLAM algorithm effectively removes the feature points belonging to dynamic pedestrians.

Fig. 5

The comparison of the ORB features extraction situation between ORB-SLAM2 and ESD-SLAM.

4.2 Qualitative evaluation on pose estimation

The proposed ESD-SLAM is developed based on ORB-SLAM2. ORB-SLAM3 is an improvement of ORB-SLAM2 by the original author. We compare the camera trajectories of three sequences (walking_xyz, walking_halfsphere, and sitting_static) produced by the three systems in X-Y axis plane. Figure 6 is the trajectories estimated by ORB-SLAM2, ORB-SLAM3, and ESD-SLAM. The estimated trajectories of the proposed ESD-SLAM are stated in the left column, while the estimated trajectories of original ORB-SLAM2 and ORB-SLAM3 are shown in the middle and right columns respectively. The relative translation error of the system is intuitively expressed by the length of the red line. The estimated trajectory of the ORB-SLAM algorithms is significantly different from the actual trajectory and its relative translation error is much larger than ESD-SLAM.

Fig. 6

Trajectory comparison of the ESD-SLAM and ORB-SLAM algorithms.

4.3 Quantitative evaluation on pose estimation

We compare the localization accuracy of ESD-SLAM with other advanced algorithms on six sequences of the dataset (walking_halfsphere, walking_rpy, walking_static, walking_xyz, sitting_ halfsphere, and sitting_static). Absolute Trajectory Error (ATE) is used for quantitative evaluation, reflecting the gap between the estimated and the real. Root mean square error (RMSE) measures the robustness and stability of the system. Each sequence is run five times to get five RMSEs,and then the median, average, and minimum of the five RMSEs are calculated to reduce the impact of system uncertainty.

4.3.1 Comparison with ORB-SLAM2 and ORB-SLAM3

As shown in Tables 1 and 2, the proposed ESD-SLAM significantly reduces the absolute trajectory error in all seven sequences. In high dynamic sequences (walking_halfsphere, walking_rpy, walking_static, and walking_xyz), ORB-SLAM2 cannot deal with the feature points of moving targets effectively, which leads to large localization error. ORB-SLAM3 slightly alleviates this problem. The proposed ESD-SLAM improves the performance of the four sequences by an order of magnitude, and the overall localization error is reduced by 94% to 98%. For the low dynamic sequence (sitting_half and sitting_static), ESD-SLAM has a relatively small improvement on localization accuracy. This is because ESD-SLAM identifies people who only make gesture changes as “moving objects”, and it removes many feature points from the static part of the person. For the pure static sequence (fr1_360), the localization accuracy of the three algorithms is basically the same.

Table 1
Comparisons of RMSE of ATE [m] for ESD-SLAM with ORB-SLAM2 and ORB-SLAM3

ORBSLAM2 ORBSLAM3 ESD-SLAM

Mean Median Min Mean Median Min Mean Median Min

w_half 0.5734 0.5639 0.4166 0.4375 0.4342 0.3383 0.0223 0.0227 0.0201

w_rpy 0.8362 0.8414 0.6922 0.8382 0.9123 0.6652 0.0305 0.0311 0.0280

w_static 0.3289 0.3570 0.2127 0.3750 0.3555 0.3365 0.0061 0.0061 0.0059

w_xyz 0.6603 0.6070 0.5147 0.7314 0.7224 0.6846 0.0146 0.0146 0.0142

s_half 0.0234 0.0227 0.0222 0.0352 0.0417 0.0225 0.0207 0.0210 0.0194

s_static 0.0085 0.0084 0.0080 0.0106 0.0103 0.0095 0.0057 0.0057 0.0052

fr1_360 0.1523 0.1550 0.1264 0.1530 0.1529 0.1258 0.1485 0.1529 0.1377

	ORBSLAM2	ORBSLAM3	ESD-SLAM
w_half	0.5734	0.5639	0.4166	0.4375	0.4342	0.3383	0.0223	0.0227	0.0201
w_rpy	0.8362	0.8414	0.6922	0.8382	0.9123	0.6652	0.0305	0.0311	0.0280
w_static	0.3289	0.3570	0.2127	0.3750	0.3555	0.3365	0.0061	0.0061	0.0059
w_xyz	0.6603	0.6070	0.5147	0.7314	0.7224	0.6846	0.0146	0.0146	0.0142
s_half	0.0234	0.0227	0.0222	0.0352	0.0417	0.0225	0.0207	0.0210	0.0194
s_static	0.0085	0.0084	0.0080	0.0106	0.0103	0.0095	0.0057	0.0057	0.0052
fr1_360	0.1523	0.1550	0.1264	0.1530	0.1529	0.1258	0.1485	0.1529	0.1377

Table 2

Accuracy improvement of ESD-SLAM against ORB-SLAM2 and ORB-SLAM3

	Improvement of our approach against ORB-SLAM2			Improvement of our approach against ORBSLAM3
	Mean	Median	Min	Mean	Median	Min
w_half	96.10%	96.00%	94.60%	94.90%	94.80%	94.00%
w_rpy	96.30%	96.30%	96.00%	96.30%	96.60%	95.80%
w_static	98.10%	98.30%	97.20%	98.40%	98.30%	98.20%
w_xyz	97.80%	97.60%	97.10%	98.00%	97.90%	97.90%
s_half	11.20%	7.60%	12.60%	41.10%	49.60%	13.70%
s_static	33.10%	31.70%	35.00%	46.20%	44.70%	45.20%
fr1_360	2.50%	5.61%	-8.93%	2.94%	0.00%	-9.45%

4.3.2 Compare with other dynamic SLAM systems

Four advanced SLAM systems for dynamic environments are compared with ESD-SLAM in Table 3. The data of four existing systems are all from relevant articles, and the data of ESD-SLAM is the average value of 5 RMSEs. On walking_halfsphere, walking_rpy, walking_xyz, and sitting_static sequences, ESD-SLAM is superior to other systems in accuracy and robustness. On walking_static and sitting_halfsphere sequences, ESD-SLAM is close to the best sequence accuracy.

Table 3
Comparison of RMSE of ATE [m] between this system and other dynamic SLAM system

Dyna-SLAM DS-SLAM DOT-Mask SaD-SLAM ESD-SLAM

w_half 0.025 0.0303 0.04 0.0257 0.0223

w_rpy 0.035 0.4442 0.053 0.0318 0.0305

w_static 0.0060 0.0081 0.0080 0.0166 0.0061

w_xyz 0.0150 0.0247 0.0210 0.0167 0.0146

s_half 0.0170 – – 0.0151 0.0207

s_static – 0.0065 0.0060 0.0060 0.0057

	Dyna-SLAM	DS-SLAM	DOT-Mask	SaD-SLAM	ESD-SLAM
w_half	0.025	0.0303	0.04	0.0257	0.0223
w_rpy	0.035	0.4442	0.053	0.0318	0.0305
w_static	0.0060	0.0081	0.0080	0.0166	0.0061
w_xyz	0.0150	0.0247	0.0210	0.0167	0.0146
s_half	0.0170	–	–	0.0151	0.0207
s_static	–	0.0065	0.0060	0.0060	0.0057

4.4 Time execution

Real-time performance is an important factor in evaluating SLAM system in practical application. In this part, the experiment evaluates the tracking time of each system. Table 4 shows the average tracking time. The time of tracking each frame in ESD-SLAM is approximately 90– 110 ms, which is much better

Table 4
Real-time comparison of ESD-SLAM with other dynamic SLAM algorithms

Approach	Segmentation	Track Each	GPU
	Time (ms)	Frame (ms)
DynaSLAM	195ms	>300ms	Tesla M40
DS-SLAM	59.4ms	>65ms	P4000
ESD-SLAM	71ms	90– 110ms	RTX 2080 Ti

than that of DynaSLAM. For DS-SLAM, although it takes less time to process each frame than ESD-SLAM, the localization accuracy is not as well as that of ESD-SLAM.

4.5 Dense 3-D mapping

In this section, the dynamic objects in the environment are filtered and culled with ESD-SLAM, which finally generates the static dense point cloud maps without dynamic objects. Three sequences of the TUM dataset (sitting_static, walking_static,and walking_xyz) are tested. As Fig. 7 shows, the left images are the results without dynamic objects elimination by using ORB-SLAM2, which contains a lot of double shadows construction, while the right images are the results with dynamic objects elimination by using ESD-SLAM, where dynamic objects are successfully removed.

Fig. 7

Dense 3-D mapping.

4.6 Evaluation in real word

To demonstrate the effectiveness and the robustness of our system, we collected experimental data of the real scenes in the underground parking lot. The RGB images and corresponding depth data are captured by Intel Real-sense D435i camera.

Figure 8 shows the comparison of ORB features extraction situation between ORB-SLAM2 and ESD-SLAM in the real world. In Figure(a), there are lots of features extracted by ORB-SLAM2 lying on the walking people, while in Figure(b), the feature points on the moving people are basically removed by ESD-SLAM, and almost all feature points are extracted in the static background. Figure 9 shows the comparison of estimated camera trajectory between ORB-SLAM2 and ESD-SLAM. The blue trajectory estimated by ESD-SLAM perfectly forms a closed loop just as how the camera moves, which qualitatively reflects our accuracy. However, the dotted trajectory estimated by ORB-SLAM2 is unable to form a closed loop due to the existence of dynamic ORB features.

Fig. 8

Comparison of the ORB features extraction situation between ORB-SLAM2 and ESD-SLAM.

Fig. 9

Qualitative comparison of trajectory between ORB-SLAM2 and ESD-SLAM.

5 Conclusion

This paper presents an efficient semantic dynamic SLAM (ESD-SLAM) for indoor dynamic scenes based on deep learning, which uses FcHarDNet lightweight semantic segmentation network to extract semantic information. Then region growth algorithm is used to optimize the boundary after semantic segmentation. Combined with multi-view geometry, the dynamic features in the environment are eliminated, which improves the localization accuracy and stability of the system in the dynamic environment. Finally, we carried out experiments in public TUM RGB-D dataset and in real laboratory environment. The results show that the proposed algorithm can significantly improve the localization accuracy and real-time performance in a dynamic environment.

References

Cadena

, Carlone

, Carrillo

, Latif

, Scaramuzza

, Neira

, Reid

and Leonard

J.J.

, Past, present, and future of simultaneous localization and mapping: Toward the robust-perception age, IEEE TransRobot 32(6) (2016), 1309–1332.

Taketomi

, Uchiyama

and Ikeda

, Visual SLAM algorithms:A survey from 2010 to 2016, IPSJ Trans Comput Vis Appl 9(1) (2017), 1–11.

Mur-Artal

and Tardós

J.D.

, ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras, in, IEEE Transactions on Robotics 33(5) (2017), 1255–1262.

Campos

, Elvira

, Rodríguez

J.J.G.

, Montiel

J.M.M.

and Tardós

J.D.

, ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual-Inertial and Multi-Map SLAM, Jul 2020. [Online]. Available: http://arxiv.org/abs/2007.11898

Qin

, Li

and Shen

, VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator, in, IEEE Transactions on Robotics 34(4) (2018), 1004–1020.

Engel

, Schöps

and Cremers

, LSD-SLAM: Large-scale direct monocular SLAM, in European Conference on Computer Vision (ECCV), 2014, pp. 834–849.

Fischler

M.A.

and Bolles

R.C.

, Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,, Communications of the ACM 24(6) (1981), 381–395.

Cadena

, Carlone

, Carrillo

, Latif

, Scaramuzza

, Neira

, Reid

and Leonard

, Past, present, and future of simultaneous localization and mapping: Toward the robust-perception age, IEEE Transactions on Robotics 32(6) (2016), 1309–1332.

Engel

, Koltun

and Cremers

, Direct sparse odometry,, IEEE Transactions on Pattern Analysis and Machine Intelligence 40(3) (2018), 611–625.

10.

Saputra

M.R.U.

, Markham

and Trigoni

, Visual SLAM and Structure from Motion in Dynamic Environments: A Survey, ACM Comput Surv 51(2), Article 37, 2018.

11.

Panchpor

A.A.

, Shue

and Conrad

J.M.

, A survey of methods for mobile robot localization and mapping in dynamic indoor environments, 2018 Conference on Signal Processing And Communication Engineering Systems (SPACES), IEEE, 2018.

12.

Tan

, Liu

, Dong

, Zhang

and Bao

, Robust monocular SLAM in dynamic environments, in 2013 IEEE International Symposium on Mixed and Augmented Reality, ISMAR 2013, IEEE, oct 2013, pp. 209–218.

13.

Dai

, Zhang

, Li

and Fang

, RGB-D SLAM in Dynamic Environments Using Points Correlations,, IEEE Robotics and Automation Letters 2(4) (2018), 2263–2270.

14.

and Lee

, RGB-D SLAM in Dynamic Environments Using Static Point Weighting,, IEEE Robotics and Automation Letters 2(4) (2017), 2263–2270.

15.

Sun

, Liu

and Meng

M.Q.

, Improving RGB-D SLAM in dynamic environments: A motion removal approach,, Robotics and Autonomous Systems 89 (2017), 110–122.

16.

Kim

D.H.

, Han

S.B.

and Kim

J.H.

, Visual odometry algorithm using an RGB-D sensor and IMU in a highly dynamic environment, Advances in Intelligent Systems and Computing 345 Springer Verlag, 2015, pp. 11–26.

17.

, et al., DS-SLAM: A Semantic Visual SLAM towards Dynamic Environments, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 2018, pp. 1168–1174.

18.

Bescos

, F\acute acil

J.M.

, Civera

and Neira

, DynaSLAM: Tracking, Mapping, and Inpainting in Dynamic Scenes, in IEEE Robotics and Automation Letters 3(4) (2018), 4076–4083. doi: 10.1109/LRA.2018.2860039

19.

Vincent

, Labb\acute e

, Lauzon

J.-S.

, Grondin

, Comtois-Rivet

P.M.

and Michaud

, Dynamic Object Tracking and Masking for Visual SLAM, 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 2020, pp. 4974–4979.

20.

Yuan

and Chen

, SaD-SLAM: A Visual SLAM Based on Semantic and Depth Information, 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 2020, pp. 4930–4935.

21.

Badrinarayanan

, Kendall

and Cipolla

, SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation, in, IEEE Transactions on Pattern Analysis and Machine Intelligence 39(12), 2481–2495.

22.

, Gkioxari

, Doll\acute ar

and Girshick

, Mask R-CNN, 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 2017, pp. 2980–2988.

23.

Chao

, Kao

, Ruan

, Huang

and Lin

, HarDNet: A Low Memory Traffic Network, 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea (South), 2019, pp. 3551–3560.

24.

Huang

, Liu

, Van Der Maaten

and Weinberger

K.Q.

, Densely Connected Convolutional Networks, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 2017, pp. 2261–2269.

25.

Tan

, Liu

, Dong

, Zhang

and Bao

, Robust monocular SLAM in dynamic environments, in Proc. IEEE Int. Symp. Mixed Augmented Reality, 2013, pp. 209–218.

26.

Everingham

, Van Gool

, Williams

C.K.

, Winn

and Zisserman

, The pascal visual object classes (voc) challenge,, International Journal of Computer Vision 88(2) (2010), 303–338.

27.

Sturm

, Engelhard

, Endres

, Burgard

and Cremers

, A benchmark for the evaluation of RGB-D SLAM systems, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, Vilamoura-Algarve, Portugal, 2012, pp. 573–580.

ESD-SLAM: An efficient semantic visual SLAM towards dynamic environments

Abstract

Keywords

1 Introduction

2 Related work

3 System description

3.1 Framework of ESD-SLAM

3.3 Semantic segmentation boundary optimi-zation based on region growth algorithm

4.1 Comparative experiment of extracting feature points

4.3.1 Comparison with ORB-SLAM2 and ORB-SLAM3

Table 4 Real-time comparison of ESD-SLAM with other dynamic SLAM algorithms

References

Table 4
Real-time comparison of ESD-SLAM with other dynamic SLAM algorithms