Abstract
Visual Simultaneous Localization and Mapping (SLAM) is one of the key technologies for intelligent mobile robots. However, most of the existing SLAM algorithms have low localization accuracy in dynamic scenes. Therefore, a visual SLAM algorithm combining semantic segmentation and motion consistency detection is proposed. Firstly, the RGB images are segmented by SegNet network, the prior semantic information is established and the feature points of high-dynamic objects are removed; Secondly, motion consistency detection is carried out, the fundamental matrix is calculated by the improved Random Sample Consistency (RANSAC) algorithm, the abnormal feature points are output by the epipolar geometry method, and the feature points of low-dynamic objects are eliminated by combining the prior semantic information. Thirdly, the static feature points are used for pose estimation. Finally, the proposed algorithm is tested on the TUM dataset, the algorithm in this paper reduces the average RMSE of ORB-SLAM2 by 93.99% in highly dynamic scenes, which show that the algorithm can effectively improve the localization accuracy of the visual SLAM system in dynamic scenes.
Keywords
Introduction
Simultaneous Localization and Mapping (SLAM) can be described as a robot that estimates its own motion and establishes an environment map during the movement process in an unknown environment, according to the data returned by the sensors it carries. It is visual SLAM due to its visual sensor [1].
Nowadays, there are many mature solutions for visual SLAM, such as RGB-D SLAM [2] based on depth camera, LSD-SLAM [3] based on direct method and ORB-SLAM2 [4, 5] based on feature point method. ORB-SLAM2 is currently considered to be one of the best solutions based on the feature point method, and its additional use of the relocation method ensures higher localization accuracy. However, most algorithms such as ORB-SLAM2 are based on static scenes for tracking and localization, which is not suitable for dynamic scenes. The movement of dynamic objects in the scene may cause the camera to produce relative motion to the moving objects, thus making the accuracy of localization poor. Therefore, it is especially important to realize accurate localization of visual SLAM in dynamic scenes.
This paper proposes a visual SLAM algorithm based on semantic segmentation and motion consistency detection. The combination of the two methods enhances the visual SLAM system’s ability to understand the scene, and effectively eliminate dynamic feature points, and only use static feature points for pose estimation, which can effectively improve the localization accuracy of the visual SLAM system in dynamic scenes. The rest of this paper is structured as follows: the second part is related work, the third part is the system detail of the proposed algorithm, the fourth part is the experimental results, and the fifth part is the conclusion.
Related work
At present, the more commonly used method is to remove the dynamic region of the image in the front-end visual odometry, and it only uses the information of the static region to calculate the transformation relationship between poses [6]. Li et al. proposed a static weighting method, which utilized the weight to represent the possibility of the depth edge point being a static point, and combined the strength-assisted iterative closest point algorithm to complete the inter-frame registration, it reduced the impact of dynamic objects on the system localization accuracy [7]. Wei et al. applied the geometric constraint method to determine the sparse feature points, and then combined the color and depth information to segment the dynamic region, and eliminated the feature points in the region [8]. Dai et al. used the Delaunay [9] triangulation method to connect the feature points into multiple groups of triangles, and compared the transformation of the triangle edges between adjacent frames. If the edge transformation was too large, it was considered to contain dynamic feature points and removed the feature points. Finally, the maximum connected area of the triangle was selected as the static area [10]. Yang et al. first used the geometric constraint method to roughly filter out the dynamic feature points, and then applied Random Sample Consensus (RANSAC) sample point sampling method based on region division to calculate fundamental matrix, and combined with epipolar geometry to filtering out dynamic feature points accurately [11]. Chen et al. exploit the characteristic that the direction of optical flow of dynamic objects does not coincide with that of static objects, combine depth images to remove optical flow unrelated to camera motion, and find out the threshold of optical flow of moving objects by clustering to achieve high accuracy detection of dynamic regions [12]. Lin et al. segmented the static region by means of a cumulative map of reprojection errors and optimized the dynamic and static regions with different weights to obtain accurate poses [13].
With the development of deep learning, combining semantic information can give better results. Bescos et al. proposed DynaSLAM, which combined multi-view geometry and deep learning-based Mask R-CNN [14] semantic segmentation network to remove dynamic feature points and used static feature points for tracking and localization [15]. The system has high localization accuracy for dynamic scenes, but the real-time performance is poor. The DS-SLAM proposed by Yu et al. distributed the epipolar geometry method and semantic segmentation to two threads, which improved the calculation speed [16]. However, the calculation of the fundamental matrix in epipolar geometry contains too many exterior points, which leads to the localization of the system less accurate. Yang et al. first utilized YOLOv3 [17] to produce semantic borders, then removed the feature points in the dynamic label, reduced the number of exterior points, calculated a more stable fundamental matrix, and then used epipolar geometry to determine the dynamic feature points [18]. But in this method pixel-level semantic segmentation is not implemented, and a certain number of static feature points will be additionally removed when the feature points in the label frame are removed, so that the internal points of the fundamental matrix are reduced. Ruan et al. utilized the Mask R-CNN network to perform pixel-level semantic segmentation of images, it established prior semantic information of dynamic objects, and then combined with epipolar geometry to remove dynamic feature points [19]. AI et al. proposed the DDL-SLAM algorithm, adding a Mask R-CNN based semantic segmentation network and multi-view geometry to detect moving objects on top of ORB-SLAM2, and used static objects for pose estimation [20]. Han et al. proposed the PSP Net-SLAM system based on ORB-SLAM2, which employed optical flow and semantic segmentation to detect and eliminate dynamic points for accurate localization of visual SLAM in dynamic scenes [21]. But these methods only consider people as dynamic objects, except dynamic objects other than people may still affect the localization accuracy of the system. Huang et al. first used the Fast-SCNN network to remove potential dynamic feature points, and then added static points using motion constraints with chi-square tests [22]. Wei et al, used the BiseNet V2 network for semantic segmentation and combined with dense optical flow for target detection to identify specific dynamic objects, reject dynamic regions and construct static maps [23]. Xu et al. improved the single response matrix accuracy and then combined it with YOLOv3 network to reduce the effect of fuzzy noise to improve the localization accuracy of SLAM in dynamic scenes [24]. Hu et al. used DeepLabv3 + semantic segmentation network to segment potentially dynamic objects, then used a multi-view geometric approach to detect object motion information, and finally used an ant colony strategy to reduce detection time and improve real-time performance [25]. Ran et al. used semantic segmentation for target detection, tracking and localization of multiple objects and background reconstruction, with different moving object planes solved independently for rigid body motion and tracked and position independently, while non-planar moving objects were rejected as outliers and not mapped for 3D reconstruction [26].
In order to enable visual SLAM to obtain higher localization accuracy in dynamic scenes, a dynamic scene visual SLAM algorithm combining semantic segmentation and motion consistency detection is proposed. It identifies dynamic objects, divides them into high-dynamic objects and low-dynamic objects, and filters out dynamic feature points on dynamic objects for three times, and only uses static feature points for pose estimation. This algorithm is mainly applicable to indoor environment, so the RGB-D model is adopted. The experimental results show that the algorithm effectively improves the localization accuracy of the visual SLAM system in dynamic scenes.
System introduction
System frameworks
The algorithm in this paper is improved on the RGB-D model of the ORB-SLAM2 algorithm, added a semantic segmentation thread and improved the tracking thread. As shown in Fig. 1, there are total of 4 threads, which including tracking, local map, closed loop detection and semantic segmentation. The semantic segmentation thread and the tracking thread receive RGB images at the same time. The semantic segmentation thread performs pixel-level semantic segmentation on the RGB image, adds semantic labels to the segmentation results, and inputs them to the tracking thread as prior semantic information. The tracking thread adds a dynamic feature point filtering module to remove dynamic feature points and only keep static feature points for pose estimation. The other two threads are as same as ORB-SLAM2.

System framework.
The SegNet deep learning network [27] based on the caffe [28] framework is utilized to perform pixel-level semantic segmentation of RGB images. The PASCLAL VOC [29] dataset is applied as a training model, the SegNet network trained with this model can segment 20 types of objects. The input of the semantic segmentation thread is an RGB image, and the output is a segmented image with labels. Semantic recognition can enhance the scene understanding ability of the visual SLAM system, and the output semantic segmentation images are input to the tracking thread as prior semantic information. For indoor environment, this paper takes “Person” class label as the prior semantic information of high-dynamic objects, and other indoor objects “Bottle”, “Chair”, “Table”, “Plant” and “Tv/monitor” label as the prior semantic information of low-dynamic objects. The prior semantic information is updated in real time and each frame image has its own independent prior semantic information. The current frame segmentation image only takes effect on the RGB image judgment of the current frame. After the tracking thread uses the prior semantic information and removes the dynamic feature points of the current frame image, the prior semantic information of the current frame becomes invalid, and the prior semantic information of the next frame will overwrite it.
Dynamic feature points filtering method
Figure 2 is the flow chart of the dynamic feature point filtering module. The ORB feature points are extracted and the descriptors are calculated, and the output result of the semantic segmentation thread is waited for. When the tracking thread receives the semantic segmentation image, the prior semantic information is used to detect dynamic objects’ feature points.

Dynamic feature point filtering module flow chart.
People are most likely to be moving objects in an indoor environment. Because of their large size, a large number of feature points are extracted from people. However, ORB-SLAM2 is based on a static environment and cannot filter out too many dynamic features, at the same time, the bag-of-words model [30] utilized in ORB-SLAM2’s closed-loop detection relies on feature point descriptors, which may affect closed-loop detection when a person’s spatial position in the scene changes and causing the system to generate “false positives” [31]. These impact will have a great impact on the localization accuracy of visual SLAM. Therefore, the people are regarded as high-dynamic objects, and the feature points located on the people are eliminated. The “Person” class label is used as the prior information of the high-dynamic object. After the tracking thread receives the semantic segmentation image, if there is a “Person” label, it will first remove all the feature points in the label, which perform the first dynamic feature point filtering.
However, in addition to high-dynamic objects such as people, there are also other low-dynamic objects that may move, such as chairs, bottle et al. If a certain number of feature points are extracted on these objects that have moved, it will have a certain impact on the localization accuracy of visual SLAM. Therefore, the motion consistency detection is performed on the remaining feature points, and abnormal feature points are output and put them into points set P out . After the motion consistency detection module outputs abnormal feature points, the tracking thread receives the segmentation result of the semantic segmentation thread again, and the “Bottle”, “Chair”, “Table”, “Plant” and “Tv/monitor” are regard as low-dynamic objects and set as the label to be detected, which is used as the prior semantic information of low-dynamic objects. At this time, the abnormal feature points in the point set are accessed. When a certain number of abnormal feature points fall within the semantic labels of low-dynamic object classes, all the feature points in this class of area will be eliminated, and the dynamic feature points will be filtered out for the second time.
The PASCLAL VOC dataset utilized as a training model only contains 20 types of objects, and there may be other moving objects that have not been recognized, which lacking prior semantic information of low-dynamic objects that not included in the dataset. Therefore, after the second filtering of dynamic feature points, all abnormal feature points in the point set P out are regarded as dynamic feature points and eliminated, and the third filtering of dynamic feature points is performed, and the remaining stable and reliable static feature points are used for pose estimation.
The motion consistency detection is performed on the remaining feature points, and the abnormal feature points are output by the method of epipolar geometry based on the assumption of epipolar constraints.
Figure 3 shows the epipolar geometric relationship between feature points on two adjacent frames of images. O1 and O2 are the optical centers of the camera in different poses, P (x, y, z) is the observed point in three-dimensional space, p1 (u1, v1) and p2 (u2, v2) are the projections of point P on plane I1 and plane I2, and are the feature points at the same time, which are two-dimensional pixel coordinates. The three points O1PO2 constitute the polar plane E p , which intersects with the plane I1 on the polar line l2, and intersects with the plane I2 on the polar line l1. The line segment O1O2 intersects at the plane I1 and the plane I2 at the points e1 and e2, e1 and e2 are the poles.

Schematic diagram of epipolar geometry.
Assuming that p1 and p2 are correct feature matching, let the normalized coordinates of p1 and p2 to be P1 = (u1, v1, 1) and P2 = (u2, v2, 1), then Equation (1) is satisfied
d1 and d2 are the depth values of the feature points p1 and p2, K is the camera internal parameter matrix, R and t respectively represent the rotation and translation transformation of the camera moving from point O1 to point O2.
Let, x1 = K-1P1, x2 = K-1P2we can get
Where C0 and C1 are constants, and multiplying both sides of Equation (2) to the left by
Where [t∧] is the antisymmetric matrix corresponding to the translation vector t, the left side of Equation (3) is equal to zero, so there is Equation (4)
We can Equation (5) when put x1 = K-1P1 and x2 = K-1P2 into Equation (4)
Equation (5) is the polar constraint relationship between the normalized coordinates P1 and P2 of the pixels p1 and p2 on the plane I1 and the plane I2, indicating that P1 and P2 are on the plane O1PO2, where F is the fundamental matrix. The vector
However, because of the existence of errors in practical applications, the normalized coordinate point P2 does not necessarily fall completely on the epipolar line l1, and the distance D from P2 to the epipolar line l1 should tend to zero. When D is too large, there are two situations: one is that p1 and p2 are incorrectly matched, and the other is that point P has moved. In the case where P moves, as shown in Fig. 3, when the camera moves from O1 to O2, P moves to P′, the projection at the plane I2 is
The polar line l1 corresponding to the normalized coordinate point P1 is on the plane I2, and it satisfies Equation (6)
Where (a, b, c)
T
represents the direction vector of the epipolar line l1, F is the fundamental matrix, and the distance from the normalized coordinate P1 to the epipolar line l1 can be expressed as
The threshold ξ0 is set, when D> ξ0, let p2 be the abnormal feature point and put it into the point set
The fundamental matrix F can represent the pose change relationship of two adjacent frames of images, and is the key information for judging dynamic feature points by using epipolar geometry. Let F2 be the current frame image, F1 be the previous frame image, and their feature point sets are
However, the matching point set M p may have a certain number of mismatches and matching point pairs of dynamic feature points other than people. Too many exterior points make the fundamental matrix F solved by the traditional RANSAC [32] algorithm have a large error, which affects the subsequent determination of other dynamic feature points except for people. How to solve a stable fundamental matrix is the key to testing abnormal feature points.
The fundamental matrix is calculated by using the improved RANSAC algorithm. On the basis of the traditional RANSAC algorithm, two main improvements have been made: the first is to filter the initial data to remove incorrect matching and matching point pairs that may have large errors; the second is to use double iterations and additional threshold judgments, after the first iteration, the static feature point assumption method is applied to screen the data again, and a more accurate feature matching point pair is selected as the new data input, and a better interior point is used for the second iteration to solve a more stable fundamental matrix.
First, the matching point pairs in the matching point set M p of the current frame F2 and the previous frame F1 are screened out. The threshold H0 is set. When the Hamming distance of the matching point pair is larger than H0, the matching point pair is eliminated. At the same time, in order to reduce the influence of pixel-level semantic segmentation error, all feature points in F2 are traversed, and when there is a “Person” semantic label in the 3 pixels around the feature point, the matching point pair corresponding to this feature point is cleared. Then the filtered matching point set M pre ={ (p1p2) 1, (p1p2) 2, (p1p2) 3, ⋯ , (p1p2) a } is obtained, where a is the total number of matched point pairs after filtering. The point set has been preliminarily screened out exterior points, and more reliable data is used for iteration. The specific iterative process is as follows:
Step 1) Eight point pairs in matching point set M pre are select randomly, and the selected data is used to estimate the mode M0.
Step 2) The estimated model M0 is used to check all the point pairs in the matching point set M pre , the point pairs are saved that satisfy the model M0, and the number of point pairs n is recorded.
Step 3) Step 1) and step2) are repeated, when the iteration meets the requirement of times, the set with the largest number of point pairs is selected, the model parameters are output, and the initial value F0 of the fundamental matrix is solved.
Step 4) The initial value F0 of the fundamental matrix is substituted into the Equations (5) and (6), the static feature point assumption distance threshold d0 is set, all the point pairs in the matching point set M pre are traversed, when there is D< d0, the detected point will be put the pair (p1p2) i into the new point set M cur ={ (p1p2) 1, (p1p2) 2, (p1p2) 3, ⋯ , (p1p2) b }, where b is the total number of new matching point pairs.
Step 5) A second iteration is performed with better interior points, eight point pairs in the new point set M cur are selected randomly, and the model M0 is estimated again.
Step 6) The model M0 is selected to check all the point pairs in the new point set M cur , the data satisfying the model M0 is saved, and the number of point pairs n is recorded.
Step 7) Step 5) and step 6) are repeated, when the number of iterations is reached, the model parameters with the largest number of point pairs are output, and the final stable fundamental matrix F is obtained.
Finally, the solved stable fundamental matrix F is put into Equations (6) and (7), Equation (7) is used to verify all matching point pairs in the matching point set M p , and abnormal feature points are output.
Experimental result and analysis
The experiment was carried out on a computer with processor Intel i5-10300 H CPU @2.50 GHz, memory 16G, graphics card NVIDIA GeForce GTX1650Ti, and operating system Ubuntu18.04. In order to simulate dynamic scenes in indoor environments, the dynamic object sequence of the TUM dataset [33], which consists of a “walking” sequence and a “sitting” sequence, was selected as the image input for the experimental RGB-D data. The data of the TUM dataset is collected from the actual environment by the depth camera, and the image part is composed of a set of color maps and corresponding depth maps, and also contains the real motion trajectory of the camera. In order to verify the effectiveness and localization accuracy of the proposed algorithm, the dynamic feature point elimination experiment and the SLAM system performance test experiment were carried out respectively.
Dynamic feature elimination experiment
This experiment is used to test the ability of the proposed algorithm to eliminate feature points of dynamic objects, including high-dynamic objects such as “people” and low-dynamic objects such as “Chair”. The dynamic object sequence “walking_xyz” from the TUM dataset was used as the image input to compare the effectiveness of ORB-SLAM2 and this paper’s algorithm to eliminate dynamic feature points.
Figure 4 is the effect diagram of the dynamic feature point removal of the people, the corresponding image sequence is the 386th frame and the 571th frame, and the persons are in a walking state in these two frames of images. The ORB-SLAM2 algorithm extracted a certain number of feature points on the persons. The algorithm in this paper removed the feature points on the persons after semantic segmentation and motion consistency detection.

People dynamic feature point removal effect.
Figure 5 is the rendering of the dynamic feature point culling of people and other objects. The corresponding image sequences are the 600th and 642th frames. A person in these two images move the chair and sit down, and both the person and the chair on the right side of the picture have moved. ORB-SLAM2 extracted feature points from person and moving chair. The algorithm in this paper considered the influence of other objects except people, as shown in Fig. 5(b), so the feature points on person and chair were eliminated. And the feature points on the chair that did not move on the left side of the image were preserved.

The culling effect of dynamic feature points of people and other objects.
In order to verify the localization accuracy of the proposed algorithm in dynamic scenes, a SLAM system performance test experiment was conducted. The ORB-SLAM2 algorithm, the DS-SLAM algorithm and the algorithm in this paper are compared separately. ORB-SLAM2 is considered to be one of the best and most stable SLAM solutions based on the feature point method. At the same time, DS-SLAM is one of the excellent visual SLAM solutions for dynamic environments, and it was also improved on the basis of ORB-SLAM2.
The Absolute Trajectory Error (ATE) and the Relative Pose Error (RPE) were used as the evaluation criteria for the experiments. The ATE refers to the difference between the real pose and the estimated pose, and the difference between the two sets of poses is calculated under the same timestamp, which can most intuitively reflect the accuracy of SLAM pose estimation and the accuracy of the estimated trajectory. The RPE represents the difference between the real trajectory and the estimated trajectory pose change per unit time, which reflects the rotational drift and translational drift state of the SLAM system pose estimate. The Root Mean Square Error (RMSE), mean error, median error, and standard deviation (std) were used serve as further detailed evaluation criteria.
Equation (8) is used as the calculation of the improvement value, where Q is the improvement value, φ is the experimental evaluation result being compared and θ is the evaluation result of the algorithm in this paper.
The experiment selected the highly dynamic object sequence “walking”, and part of the low dynamic object sequence “sitting” in dynamic object sequence. The “walking” sequence contains four sub-sequences of walking_xyz (w_xyz), walking_halfsphere (w_half), walking_rpy (w_rpy) and walking_static (w_stat). The “xyz” sequence camera translates along the xyzh axis, the “halfsphere” sequence camera moves in a hemispherical shape, the “rpy” sequence camera rotates in three directions along the xyzh axis, and the “static” sequence camera basically remains stationary. In these four sub-sequences, two persons move frequently in the scene and move some objects in the scene, which is rich in dynamic information. In the experiment, the sub-sequences of sitting_halfsphere (s_half) and sitting_static (s_stat) were selected in the “sitting” sequence. In these two sub-sequences, two persons communicate while sitting on the chair, and there is a small amount of body movement and less dynamic information.
Tables 1 to 3 show the results of the comparison tests. In the highly dynamic scene sequences, it can be seen that the value of this paper’s algorithm in ATE and translational drift are lower than ORB-SLAM2 and slightly lower than DS-SLAM. The value of this paper’s algorithm in rotational drift are better than ORB-SLAM2, and the value of the other three sequences are slightly higher than DS-SLAM except for the “w_half” sequence; In the low dynamic scene sequences, the value of this paper’s algorithm in ATE, translation drift and rotation drift differ very little from ORB-SLAM2 and DS-SLAM.
Test results of absolute trajectory error (ATE)
Test results of translational drift (RPE)
Test result of rotational drift (RPE)
Table 4 shows the improvement values of our algorithm relative to ORB-SLAM2 and DS-SLAM, where RPE(T) represents translational drift and RPE(R) represents rotational drift. For ATE, in highly dynamic sequences, while compared to ORB-SLAM2, the average RMSE of our algorithm is reduced by 93.99% and the average std by 93.2%. While compared to DS-SLAM, the average RMSE of our algorithm was reduced by 25.58% and the average std was reduced by 20.88% for “w_rpy” sequences, which was greatly improved. In low dynamic sequences, the average RMSE of our algorithm is reduced by 14.6% and the average std is reduced by 17.38% compared to ORB-SLAM2. The average RMSE of this paper’s algorithm is reduced by 6.22% and the average std is reduced by 1.34% compared to DS-SLAM. In low dynamic sequences, where the amount of character movement in the scene is low and close to a static scene, the algorithms in this paper have reduced improvement values for ORB-SLAM2 and DS-SLAM. For RPE, the improvement values are similar to and correlated with ATE. In summary, the improvements in this paper are obvious, with some improvement in localization accuracy, reduced drift and a more stable system.
Improvement value
The improved effect is also shown graphically. Figure 6 shows the ATE diagram of ORB-SLAM2, DS-SLAM and the algorithm in this paper. The black line is the truth trajectory of the camera, the blue line is the estimated trajectory, and the red line represents the difference between the two. In highly dynamic sequence, compared with ORB-SLAM2, the area of the red is greatly reduced, localization accuracy of this algorithm is significantly improved, and the localization accuracy in this paper is slightly higher than that of DS-SLAM. In “w_rpy” sequence, the algorithm in this paper has a great improvement compared to DS-SLAM. In low dynamic sequence, the difference between the localization accuracy of the three is small. The algorithm in this paper has a priori semantic information in motion consistency detection, reducing the influence of too many outlier points; the improved RANSAC algorithm is used to calculate a more stable fundamental matrix, which can more accurately use the polar line geometry to detect dynamic feature points; at the same time, the influence of dynamic objects other than humans is additionally considered, making the rejection of dynamic feature points more comprehensive. As a result, the algorithm in this paper provides some improvement in localization accuracy in highly dynamic sequences compared to DS-SLAM.

Absolute trajectory error.
Compared with other sequences, the localization accuracy of the “w_rpy” sequence is lower, because the motion mode of “w_rpy” is rotation, and the rotation speed is too fast, which causing some images to be blurred. At the same time, the feature point descriptor has poor adaptability to rotation, and a large number of stable features match point pairs is not formed, thereby the localization accuracy is reduced.
A visual SLAM algorithm based on semantic segmentation and motion consistency detection is proposed to solve the problem of low localization accuracy in dynamic senses. The algorithm filters out the dynamic feature points for three times: First, the SegNet network is used to perform pixel-level semantic segmentation of RGB images, and the segmentation results are used as prior semantic information to define people as high-dynamic objects, the “Person” label is detected, the feature points located on the people are removed, which perform the first filtering of dynamic feature points; Then, the abnormal feature points are output through motion consistency detection, and the feature points on other low-dynamic objects except people are eliminated according to the distribution of abnormal feature points and combined with prior semantic information of low-dynamic objects, which perform the second filtering of dynamic feature points; Finally, all abnormal feature points are regarded as dynamic feature points and eliminated, which perform the third filtering of dynamic feature points. Finally, the TUM dataset is selected for testing, and highly dynamic scenes and low dynamic scenes are selected respectively. Compared with ORB-SLAM2, the average RMSE of this algorithm is reduced by 93.99%, and the average RMSE of DS-SLAM is reduced by 25.58%. The improvement in low dynamic scenes is slightly lower, and the average RMSE is 14.6% lower than ORB-SLAM2 and 6.22% lower than DS-SLAM.
The SLAM algorithm proposed in this paper uses the RGB-D model. In subsequent work, additional monocular and stereoscopic models are planned. It is also planned to apply the algorithms to real robots to enable real-time map building in different environments.
