3D registration based on V-SLAM and application in augmented reality

Abstract

Augmented reality is currently a research hotspot in the field of computer vision and computer graphics, and its applications are becoming more and more extensive. One of its key technologies is the three-dimensional registration of virtual objects. Three-dimensional registration requires accurate camera pose estimation and scene three-dimensional reconstruction. Therefore, this paper studies the 3D registration based on visual SLAM, and mainly contributes the following three aspects. (1) A dense matching method based on a depth camera is proposed, which can be well applied to scenes where the camera moves fast or rotates strongly, such as augmented reality. (2) A dense reconstruction method based on Voxel Hashing is designed, which alleviates the low computational efficiency and low precision of the existing RGB-D SLAM method. (3) A simple augmented reality system was designed to verify the effects of the registration and fusion of virtual objects. Experiments show that compared with the state-of-the-art methods, the algorithm proposed in this paper generates a more refined and smooth reconstructed model, and the virtual-real fusion effect based on this model is also more realistic.

Keywords

V-SLAM RGB-D 3D reconstruction augmented reality

1. Introduction

With the increasing popularity of mobile and wearable devices, augmented reality has received unprecedented attention in recent years and has become a research hotspot in the field of computer vision and computer graphics. One of the difficulties of augmented reality is how to achieve stable and accurate 3D registration of virtual objects [1]. The premise to solve the problem is how to realize accurate camera pose estimation and real-time three-dimensional reconstruction. This happens to be the problem that the Simultaneous Localization and Mapping (SLAM) system is committed to solving. It refers to construct a digital model of the environment during the movement process, and at the same time estimate its own movement with specific sensors, without prior knowledge of the environment [2].

The sensors of the SLAM system are generally LIDAR, inertial sensor (IMU) and camera, the SLAM with the camera as the main data acquisition device is called visual SLAM (V-SLAM). Traditional V-SLAM can be divided into monocular, binocular and depth camera (RGB-D) SLAM according to different camera types and working methods. The V-SLAM system should solve two core problems which are camera positioning and map construction. Since depth cameras such as Microsoft Kinect V2 have become affordable commodity in recent years, they can easily measure the distance between the object and the camera (ie depth) with the help of infrared structured light or Time-of-Flight. Compared with the need to calculate the object depth with the help of algorithms in binocular cameras, the depth camera can save a lot of resources [3]. Therefore, the research and application of SLAM system based on RGB-D cameras have received more and more attention, and many research results have published (such as KinectFusion, Kintinuous, ElasticFusion, DVO-SLAM, etc.). However, the existing algorithms cannot meet the requirements well in terms of robustness and accuracy when the camera has rapid motion and violent rotation or the texture of scene is relatively poor.

This paper mainly studies the technology of camera localization and real-time 3D reconstruction based on V-SLAM, and proposes a tracking method based on RGB-D dense matching (Direct Method), which includes pose estimation and loop closure. Then, based on the Voxel Hashing technology [46], the dense 3D model is reconstructed and applied to the augmented reality system to realize the 3D registration. Compared with traditional algorithms, the algorithm proposed can effectively improve the accuracy of 3D registration and achieve a better fusion effect.

The main contributions of this paper are as follows.

(1)
A dense matching method based on depth camera is proposed, which can be well applied to scenes with fast camera movement or severe rotation such as augmented reality.
(2)
A dense reconstruction method based on Voxel Hashing is designed, which alleviates the low computational efficiency and low precision of the existing RGB-D SLAM reconstruction method.
(3)
A simple augmented reality system is designed to verify the registration and fusion of virtual objects in dense scenes. Experiments show that the algorithm proposed in this paper can achieve a more realistic fusion effect.

2. Related works

V-SLAM and augmented reality have been deeply studied in their respective fields. In recent years, due to the improvement of hardware and algorithms, the accuracy of camera pose estimation and the quality of real-time 3D reconstruction have been gradually improved which all provide better technical support for 3D registration. Recently, the research in the two fields shows a trend of crossing and driving each other. This section analyzes the impressive results of the two fields, and discusses the pros and cons of each.

2.1 Visual SLAM

Visual SLAM uses a camera as a sensor, which has the advantages of small size, low cost and rich information. A conventional classical frame structure of the system is shown in Fig. 1.

Figure 1.

V-SLAM structure.

The function of the visual odometry is to estimate the relative pose transformation of the camera using two frames. If the transformation matrix can be obtained, the correct tracking of the camera can be realized. From a global perspective, since the visual odometry will accumulate errors during the tracking process, optimization is required for error suppression, and finally the map is updated with the estimated camera pose. During the tracking process, the camera may move violently and cause the tracking to be lost, so relocalization is generally required. In the optimization process, loop closure detection will be performed to reduce the amount of calculations and improve the optimization effect.

Early V-SLAM mostly adopted a filter-based approach, the representatives are MonoSLAM [2] and MSCKF [4]. MonoSLAM is a representative work of filter-based SLAM, in which the motion parameters of the camera and the 3D position of the landmark are jointly expressed as a probability state. For each frame, EKF (Extended Kalman Filter) is used to predict and update the joint state. The biggest disadvantage of MonoSLAM is the O (N3) running time (N is the number of landmarks), which limits its scalability, resulting that it can only be applied to a relatively small space containing a small number of landmarks (about a few hundred). In addition, the use of EKF can easily cause quickly error accumulation. Therefore, MonoSLAM can only handle small scenes. To solve the problem of error accumulation, MSCKF introduces IMU (Inertial Measurement Unit) data as input data, and uses a sliding window to reduce error It discards the constraints between the camera and the map points when calculating, and only considers the camera position, what effectively reduces the amount of calculation, however, it only discusses the monocular case.

Due to the noise, the equations of the motion and the observation must not be accurate. Therefore, instead of assuming that the data must conform to the equation, it is better to optimize the noisy data to obtain a relatively accurate state estimation. With the development of nonlinear optimization technology, more and more V-SLAM systems begin to adopt iterative optimization as backend. Visual odometry can be divided into sparse method and dense method according to how many pixels are used.

The typical representatives of the sparse method are the PTAM and ORB-SLAM series. PTAM [5] decouples the classic architecture of V-SLAM. Specifically, it divides tracking and mapping and allocates different threads to enable them to be executed in parallel, so that the system can run in real time. PTAM is the first to use Bundle Adjustment to optimize both the pose and map. Although the system is divided into two threads, the back-end optimization takes longer time to run than the front-end. Based on PTAM, researchers have successively made some improvement. For example, in 2015, Mur-Artal et al. proposed the ORB-SLAM [6], which used ORB features characterized by (1) the faster extraction and rotation invariance; (2) optimizing the global pose by constructing a pose map; (3) improving the key frame selection scheme and map updating method, and deleting redundant key frames and map points to ensure the accuracy of pose estimation and the scale of the optimization parameters will not grow too fast.

The principle of the feature point matching is to estimate the pose between frames by calculating and minimizing the reprojection error, but some problems such as bigger time cost and mismatching exists. The method is also sensitive to the texture of image and the rapid movement of the camera, etc., so some scholars have proposed to use all pixels in the image to help estimate the pose, that is, the direct method.

The direct method minimizes the pixel metric error of the two frames to realize the pose estimation without extracting and matching feature points, so there is no problem of mismatching as well as no strict constraint on the texture of the image. DTAM [7] which proposed by Newcombe et al. in 2011 used a monocular camera to realize dense visual odometry and reconstructed the surface of the scene, but it requires parallel computing with the help of GPU. LSD-SLAM [8] proposed by Engel et al. in 2014 does not require the computing power of the GPU. It minimizes the error of the region with large gradients in the image, making the algorithm portable to mobile platforms. Forster et al. proposed SVO [9] in 2016, which only extracts the key points in the image, omits the calculation of feature descriptors, and uses the direct method to match the key points, which can significantly accelerate the calculation. Although the direct method has certain advantages in the use of overall information, there are still some problems. The reason is that the direct method assumes that the intensity of the same point in the image is constant at different moments and different observation angles, and this assumption is violated easily by factors such as light changes, exposure time, etc. In addition, it can only handle situations where the camera movement is small.

Monocular V-SLAM has scale ambiguity. Generally, it is necessary to estimate the depth information of the feature points through triangulation. The RGB-D camera represented by the Kinect series can simultaneously obtain the color image and the depth image of the scene, which saves the time-consuming depth estimation, and there is basically no problem of scale drift, so many V-SLAM systems use RGB-D images as input. Earlier to propose the use of RGB-D cameras for real-time indoor 3D reconstruction is Peter Henry et al. of the University of Washington [10], who used surfels to represent 3D models [11]. Pose is estimated based on sparse feature and dense ICP (Iterative Closest Point), TORO tools are used for back-end optimization [12, 13, 14, 15]. Lourakis et al. released a software package called Sparse Bundle Adjustment [16] in 2010, which achieved faster back-end optimization. Endres et al. [17] used RGB-D cameras to construct a 3D map. In addition, the article also compared and analyzed the performance of using different feature such as SIFT [18], SURF [19] and ORB to match between frames and the influence of the number of feature points on the estimation accuracy, the g2o toolkit is used for back-end optimization [20].

In addition to the above-mentioned RGB-D SLAM using feature points, two frames of RGB-D can be directly matched, and the pose between two adjacent frames can be estimated by minimizing intensity error and depth error [21, 22]. KinectFusion [23] was proposed by Microsoft Research Institute in 2011. This algorithm constructs a volume model for the local space, and uses the ICP algorithm to match the depth image with the built model to estimate pose, and at the same time, it reconstructs indoor scenes. However, the algorithm is only limited to indoor-sized spaces, and only the depth information is used in the calculation, and it does not provide a loop closure detection for global optimization. Based on KinectFusion, Whelan et al. expanded the size of reconstruction, adding color information and loop closure detection [24, 25, 26, 27].

2.2 Augmented reality and 3D registration

Augmented reality is a technology that integrates virtual objects into the real environment and can achieve seamless integration, rendering and interaction. With the technology, users can better perceive and interact with things in the real environment. This term first appeared in the work of Caudell et al. from Boeing Company [28]. Azuma defines the characteristics of augmented reality as the combination of virtual and real, real-time interaction and three-dimensional registration [29]. At present, augmented reality has been widely used in medical, entertainment, industry, education and other domains. 3D registration is one of the core technologies of the augmented reality system. With the maturity of augmented reality hardware and the expansion of application fields, the vision-based 3D registration has received more attention and development due to its strong versatility, high precision, and low cost [30]. Vision-based 3D registration can be divided into artificial identification registration, natural feature registration and model-based registration according to different implementation methods [31].

The method based on artificial identification usually puts the specific identification in the real environment, uses the camera to recognize its position, then compares it with the matching identification to estimate the pose and calibrate the position of the virtual object through coordinate conversion which will be finally fused in real scenes [32]. The most famous libraries are ARToolkit [33] and ARTag [34]. However, the registration based on artificial identification can only be applied to scenarios where the identification has been arranged in advance, which seriously affects the scope of application of augmented reality, and is therefore not suitable for large-scale outdoor environments.

The process of registration based on natural feature is to track the natural features in the image, use the detected targets to calculate the camera pose, and then calibrate of the objects through the conversion of the natural feature coordinates and the camera pose [35]. Commonly used feature operators include SIFT [18], SURF [19], FAST [36], ORB [37] and BRISK [38]. The registration does not need to place an identification in the real environment, but some algorithms have problems such as low running speed, low matching accuracy, and scale ambiguity during the registration process.

The model-based 3D registration generally uses the tracking object based on distinguishable features, and realizes the prediction of the camera pose through the external features of the model By converting the 3D coordinates of the model to the 2D image coordinates to augment the reality [39]. Vial et al. proposed a real-time model-based line tracking method which minimizes the distance between the sampling point on the projection line of a given 3D model and the corresponding maximum value of the image gradient to estimate the camera pose [40]. Combining edge information, Reitmayr et al. proposed a hybrid tracking system based on textured 3D model, which dynamically determines the camera pose and completes registration at runtime [41]. This kind of method has limitations in tracking region and high computational cost.

Due to the complexity of the real environment, the traditional visual SLAM does not perform well. For example, when a novice is using a mobile device to experience AR application, he probably does not have any prior knowledge of how to carefully operate the mobile device to obtain a good AR experience. In this process, it is easy to move quickly and rotate violently, and these actions even pose a huge challenge to the most advanced visual SLAM system.

3. Framework overview

This paper uses the dense matching based on RGB-D image in the visual odometry, and uses the color and depth information error metrics when optimization in the back-end. At the same time, when the camera loops back, the depth map of the key frame is constructed, and the model is updated accordingly, so as to construct a globally consistent 3D model. The algorithm flow is shown in Fig. 2.

Figure 2.

Algorithm flow.

Both the pose estimation and map construction of this system need to be accelerated by GPU. Since the goal of map construction is to be applied to augmented reality, the localization algorithm is required to be more efficient to save time for the construction of dense models, that is, localization is mainly for mapping services, and high-quality maps in turn optimize localization accuracy. To make full use of the depth and color information, the direct method is used for pose estimation. In the pose estimation and optimization stage we refer to DVO [44, 45] for the algorithm implementation which detailed in Section 4. The map is also constructed by the dense method, and we adopt the method based on TSDF. The TSDF-based model is an implicit surface representation. For each point $\text{x}\in R^{3}$ , SDF (x) is the signed distance from x to the nearest surface, that is, when x is outside the surface, its SDF is positive, otherwise is negative, x is on the surface if and only if SDF equals 0. Therefore, the surface of the model is implicitly defined by the iso-surface whose SDF is 0. In the algorithm, the space is evenly divided, and each unit is called a voxel, and its TSDF, weight, and color are stored. We arrange the voxels based on Voxel Hashing, see section 5 for details. In addition, to eliminate the accumulated drift error, the algorithm uses a method like [42] to detect loop closure using Bag-of-Words. We finally verify the effect of simple geometric 3D registration based on this algorithm. Experiments show that the accuracy of registration is improved by approximately 4.60%.

4. Pose estimation

At time $t$ , the RGB-D camera provides a color image $C_{t}$ and a corresponding depth image $D_{t}$ . Through continuous image alignment, the pose $T$ of the camera at each moment can be estimated, and then its trajectory can be obtained. The prerequisite for the application of the dense matching is the hypothesis of intensity consistency, that is, the intensity of the corresponding pixel in the two adjacent images remains unchanged.

$\displaystyle C_{t1}(x)=C_{t2}(x)$ (1)

However, due to the error of the camera itself and illumination difference, Eq. (1) cannot be strictly established. Therefore, the optimal estimation of the pose can be achieved by minimizing the intensity error. In addition to color, the depth information collected at a certain point in the space can also be compared with its reprojection. If the pose is accurate enough, the depth error should be close to zero.

Assume that a point $p=({X,Y,Z,1})^{\text{T}}$ in the space corresponds to the pixel $x=({u,v})^{\text{T}}$ , and the depth is ${Z}={Z}(x)$ . We user Eq. (2) to get $x$ from $p$ .

$\displaystyle x=\pi(p)=\left({\frac{Xf_{x}}{Z}+o_{x},\frac{Yf_{y}}{Z}+o_{y}}% \right)^{\text{T}}$ (2)

To recover $p$ from $x$ , Eq. (3) can be used.

$\displaystyle p=\pi^{-1}({x,Z})=\left({\frac{x-o_{x}}{f_{x}}Z,\frac{y-o_{y}}{f% _{y}}Z,Z,1}\right)^{\text{T}}$ (3)

where $f_{x}$ and $f_{y}$ represent focal length, and $o_{x}$ and $o_{y}$ represent optical center coordinates.

The movement of the camera can be seen as a continuous translation or rotation. Theoretically, the motion of the camera can be represented by a transformation matrix T called a special Euclidean group SE (3).

$\displaystyle T_{4\times 4}=\left[{{\begin{array}[]{*{20}c}{R_{3\times 3}}&{t_% {3\times 1}}\\ {0^{\text{T}}}&1\\ \end{array}}}\right]$ (4)

Where $R_{3\times 3}$ is the rotation matrix, which is also an element in the special orthogonal group SO (3), and $t_{3\times 1}$ represents the translation vector. Because SO (3) has a strong constraint and the two elements in SE (3) are not closed to the addition. The Lie algebra se (3) corresponding to SE (3) is used to represent the pose. se (3) is an element of the R ${}^{\wedge}$ 6 space.

$\displaystyle\text{se}(3)=\left\{{\xi=\left[{{\begin{array}[]{*{20}c}\rho\\ \phi\\ \end{array}}}\right]\in\mathbb{R}^{6},\rho\in\mathbb{R}^{3},\phi\in{\rm s}{\rm o% }(3),\xi^{\wedge}=\left[{{\begin{array}[]{*{20}c}{\phi^{\wedge}}&\rho\\ {0^{\text{T}}}&0\\ \end{array}}}\right]\in\mathbb{R}^{4\times 4}}\right\}$ (5)

Where the symbol $\wedge$ means to convert a 6D vector into a 4D matrix. The corresponding relationship between SE (3) and se (3) is shown in Eq. (6).

$\displaystyle T=\exp({\xi^{\wedge}})$ (6)

The pixels in the previous image are restored, and the coordinates in the new image obtained by re-projecting under the new camera pose $T$ can be expressed as follows.

$\displaystyle x^{\prime}=\tau(x,T)=\pi(T\pi^{-1}(x,Z(x)))$ (7)

$\tau(\cdot)$ is the projection function after transformation.

In dense matching, Eq. (7) is used to define the intensity error and depth error of a single pixel as follows.

$\displaystyle{\varepsilon}_{C}=C_{t2}({\tau({x,T})})-C_{t1}(x)$ (8) $\displaystyle{\varepsilon}_{D}=Z_{t2}({\tau({x,T})})-[{T\pi^{-1}({x,Z_{t1}(x)}% )}]_{Z}$ (9)

$[\cdot]_{Z}$ represents the $z$ coordinate of the point.

We use the weighted least squares method which is same as [45] to iteratively estimate the pose. The objective function to be optimized is as follows.

$\displaystyle\xi^{\ast}=\mathop{\text{argmin}}\limits_{\xi}\mathop{\sum}% \limits_{i}^{n}\omega_{i}\varepsilon_{i}^{\text{T}}{\Sigma}^{-1}\varepsilon_{i}$ (10)

Where $\omega_{i}$ is the weight, $\varepsilon=({{\varepsilon}_{C},{\varepsilon}_{D}})^{\text{T}}$ , and ${\Sigma}$ is the variance of the t-distribution. Use the first-order Taylor formula to linearize the above equation to obtain the linear equations of the pose increment $\Delta\xi$ .

$\displaystyle\mathop{\sum}\limits_{i}^{n}\omega_{i}J_{i}^{\text{T}}{\Sigma}^{-% 1}J_{i}\Delta\xi=-\mathop{\sum}\limits_{i}^{n}\omega_{i}J_{i}^{\text{T}}{% \Sigma}^{-1}\varepsilon_{i}$ (11)

$J_{i}$ is the Jacobian matrix, and $n$ is the number of pixels.

The above is the complete process of pose estimation by minimizing the intensity error and depth error of the RGB-D image.

Algorithm 1: Pose Estimation with GPU
1: for each pixel $x$ in Color Image $C_{t}$
2: Get the depth: $Z\leftarrow Z(x)$
3: Get the space location: $p\leftarrow\pi^{-1}({x,Z})=\left({\frac{x-o_{x}}{f_{x}}Z,\frac{y-o_{y}}{f_{y}}% Z,Z,1}\right)^{\text{T}}$
4: re-projective: $x^{\prime}\leftarrow\pi(T\pi^{-1}(x,Z(x)))$
5: $\varepsilon_{C}\leftarrow$ photometric error metric of the given pixel
6: ${\varepsilon}_{D}\leftarrow$ depth error metric of the given pixel
7: reduce error metric and solve the equations for the best $T$

5. Map construction

Because the dense 3D model can effectively solve the occlusion problem of the virtual object and the real scene faced with the monocular camera, realize more natural registration of virtual objects, and improve the user experience, so this paper uses a dense map to represent the 3D structure of the environment. In order to improve the speed of surface reconstruction and reduce the memory footprint, we modified the voxel hashing provided by [46], changing the parameters of the hash function, the size of the bucket, and the conflict handling method [47]. We use Eq. (12) to calculate the address of a voxel block.

$\displaystyle H({x,y,z})=({x\cdot p_{1}\oplus y\cdot p_{2}\oplus z\cdotp_{3}})% \textit{mod }n$ (12)

Where $x, y, z$ are the world coordinates corresponding to the current voxel block, p1, p2, and p3 are large prime numbers respectively which are 12354371, 80044597, 67347179, and n is the size of the hash table.

In the experiment, the size of our reconstructed space is about 3 m $\times$ 3 m $\times$ 3 m, and the number of voxels contained in each block is 8 $\times$ 8 $\times$ 8. Compared with [46] aiming at real-time and accurate reconstruction of the scene, our goal is to serve the 3D registration of virtual objects in augmented reality. Therefore, the size of the scene used can be smaller that directly leads to the reduction of the amount of data. The number of entries in the hash table is reduced accordingly. In [46], only 0.1% of the buckets overflowed. Therefore, the number of hash entries in the bucket is reduced to 3 in our algorithm. In [46], a linked list is used to handle hash conflicts. We use quadratic probing to obtain the storage address of the next item in the linked list after the conflict.

After setting the data structure, in order to save storage, the algorithm has two constraints. One is to divide the voxel block only for the scene in the view frustum, eliminating most of the current invisible scenes; the other is to only divide the voxel block within the truncated distance. The detail is to synchronously emit a light beam for each pixel from the optical center, and determine the block to be stored by the TSDF value.

After the voxel block allocation, the GPU accesses all entries in the hash table in parallel, identifies that occupied and visible, and marks the corresponding element in the identification array (initialized to 0) as 1 according to the index, and then uses parallel prefix sum [48] to scan the array. If an entry in the hash table meets the condition, it is accumulated on the basis of the previous element value by what forming an index array. The position where the number changes in the index array is the voxel block to be selected. The GPU is used to update the TSDF value, weight and color of each voxel in voxel block.

The map construction using the above method is equivalent to the construction of the surface of the object in space. When constructing the model surface, the TSDF zero level set can be extracted by ray casting [49], and the world coordinates can be obtained by the trilinear interpolation, finally the scene are rendered.

Figure 3.

Voxel hashing.

6. Loop closure detection

Loop closure detection generally refers to judging whether the robot has returned to a place it has been before based on the feedback of the sensor in the SLAM. If the sensor is mainly a camera, the image is used to determine whether the current frame has a corresponding description in the previous path. Loop closure can be used to adjust the estimated camera pose which can effectively suppress error accumulation. If the global pose can be corrected, a consistent map can be obtained. In the paper, Bag-of-Words is used to represent the visual features to realize image classification and similarity detection.

To improve the accuracy of closed-loop detection, the whole closed-loop detection process is divided into three steps. The first step is keyframe retrieval. For the current frame, the Bag-of-Words vector is constructed and matched with all key frames in the existing database. The key frame with the highest similarity Bag-of-Words vector is selected and the distance $\eta$ between two Bag-of-Words vector is calculated. If $\eta$ is less than a specific threshold, a candidate closed-loop is found. After the matching candidate closed-loop pairs are obtained, they are mapped together with the corresponding depth map. If the number of matched feature points meets the requirements and the proportion of non-outliers exceeds a specific threshold, the two frames are considered to have passed the random consistency test, and the closed-loop constrained relative pose transformation H of the two frames is obtained. Finally, the dense registration test is used to filter out some suboptimal closed-loop candidates and finally optimize the transformation matrix H.

7. 3D registration

The fusion effect of virtual and real scene depends on the registration of the virtual object. At present, most of augmented reality systems based on monocular cameras simply superimpose virtual objects on real scene due to lack of depth information [50], and cannot show the occlusion of real objects on virtual objects. Since we use an RGB-D camera, the occlusion of virtual and real objects can be better controlled through the dense depth map. We have implemented a simple augmented reality system. After completing the reconstruction of the 3D scene, the method of layered rendering of virtual and real objects are used which show the occlusion effect naturally [42], see the experimental part for details.

According to the estimated camera pose and dense reconstruction model, we use OpenGL to adjust the pose of the controller and the cup 3D model, and the virtual object was superimposed on the corresponding position of the windowsill and desktop through rotation and translation. Then the projection transformation and view transformation were carried out to create a 2D image of the 3D scene.

8. Experiments

In order to verify the feasibility and effectiveness of the algorithm designed in the paper, we conducted two sets of experiments in an indoor environment. The parameters of the experimental equipment are as follows: Kinect V2 RGB-D camera, Intel i9 9900Kprocessor, 32G DDR4 memory, NVidia RTX 2080Ti graphics card, and OpenGL library. We take Microsoft Visual Studio 2013 IDE as the development and testing tool, and the language is C $++$ .

8.1 Dense reconstruction

In order to analyze the performance and model quality of the reconstruction, we compared our system with KinectFusion, which also uses TSDF for voxel representation. The depth range was set as 0.5 m $\sim$ 8 m in both algorithms, and the number of voxels per meter is 256 (that is, the voxel size is about 4 mm). In our algorithm, the number of blocks in the hash table is 106, and each voxel block consists of 8 $\times$ 8 $\times$ 8 voxels. The three prime numbers in the hash function are: 12354371, 80044597, and 67347179. Figure 4 shows the real-time reconstruction effect of the two algorithms (width: 2 m, height: 2 m, depth: 1.5 m). The left image is captured by KinectFusion, and the right is ours. The algorithm in the paper reconstruct flower pots and windows more finely at the same resolution. In Fig. 4a which is the result of the KinectFusion, there are some big holes in the right-top corner and the bottom of the image. Compared with Fig. 4b in the right, although there are some small hole in the edge of the radiator and window, the effect of reconstruction algorithm has been improved greatly.

Figure 4.

Living reconstruction scenes.

Figure 5 shows the scenes rendered by the two algorithms (width: 4 m, height: 2.5 m, depth: 3 m), with a scan period of 45 seconds. The left image is still captured by KinectFusion, and the right is ours. Both methods can maximize the integrity of the model in the scene, and the surfaces of the objects reconstructed by the two algorithms are relatively smooth. To further compare the details of the reconstruction, the algorithm in this paper uses the hash table data structure and GPU streaming processing, which can fully utilize the memory, so the gap between the radiators is obvious, seeing Fig. 5b, the edges and corners of the object are more distinct, and the details of the reconstruction are more complete and clear, while Fig. 5a has lots of overlapped area which are filled with surface redundant.

Figure 5.

Reconstruction senses after scanning.

8.2 3D registration

We designed a simple augmented reality system that supports users to fuse virtual objects to the scene. Since the system is based on the camera pose obtained by the dense 3D model, it can calculate the relatively complete and accurate depth information of each frame. During the rendering process, the distance relationship between the virtual object and the real scene can be obtained, so that it can effectively dealing with the occlusion, producing a better visual effect of fusion of virtual and real. Figure 6 shows the screenshots of fusing the two virtual objects of the kettle and the controller into the real scene. It can be seen that the controller (with glass and window as the fulcrum) in the left image and the cup in the right image have achieved a more realistic occlusion effect, and the accuracy of the 3D registration meets the requirements of scene integration and consistency. We define the registration accuracy as the ratio of the area where virtual and real are fused to the corresponding area of the image captured with real objects. Comparing the effect of the algorithm in the paper with the fusion results with PTAM, the registration accuracy of our algorithm is improved by about 4.60%.

Figure 6.

Fusion senses.

9. Conclusion

We mainly study dense matching for estimating camera pose, dense reconstruction of scenes based on voxel hashing, global optimization based on Bag-of-Words and graph optimization methods, and simple virtual and real scene fusion algorithms with RGB-D camera. Through quantitative and qualitative evaluation of experiments, the system realizes pose estimation and real-time dense 3D reconstruction for indoor environment. Compared with the existing algorithms, the reconstruction effect of this algorithm is more refined and smooth, the accuracy of the 3D registration of virtual objects is increased by about 4.60%, but for outdoor scenes, due to the limitations of the sensor itself, it is not so robust. In addition, when the camera is moving fast, the system generate ghost image and even lost the tracking which will be need further research in the future.

Footnotes

Acknowledgments

This research was jointly supported by the Jilin Provincial Science and Technology Development Program (20190201265JC) and CCIT Science and Technology Project (320200011).

References

Azuma

. A survey of augmented reality. Presence Teleoperators & Virtual Environments. 1996; 6(4).

Davison

Reid

Molton

, et al. MonoSLAM: Real-time single camera SLAM. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2007; 29(6): 1052-1067.

Izadi

Kim

Hilliges

, et al. KinectFusion: real-time 3D reconstruction and interaction using a moving depth camera. ACM Symposium on User Interface Software & Technology. ACM, 2011.

Mourikis

Roumeliotis

. A Multi-State Constraint Kalman Filter for Vision-Aided Inertial Navigation. Robotics and Automation, 2007 IEEE International Conference on. IEEE, 2007.

Klein

Murray

. Parallel Tracking and Mapping for Small AR Workspaces. IEEE & ACM International Symposium on Mixed & Augmented Reality. ACM, 2008.

Mur-Artal

Montiel

JMM

Tardos

. ORB-SLAM: A Versatile and Accurate Monocular SLAM System. IEEE Transactions on Robotics, 2015; 31(5): 1147-1163.

Newcombe

Lovegrove

Davison

. DTAM: Dense tracking and mapping in real-time. IEEE International Conference on Computer Vision, ICCV 2011, Barcelona, Spain, November 6–13, 2011. IEEE, 2011.

Engel

Schps

Cremers

. LSD-SLAM: Large-scale direct monocular SLAM. European Conference on Computer Vision. Springer, Cham, 2014.

Forster

Pizzoli

Scaramuzza

. SVO: Fast semi-direct monocular visual odometry. IEEE International Conference on Robotics & Automation. IEEE, 2014.

10.

Henry

Krainin

Herbst

, et al. RGB-D Mapping: Using Depth Cameras for Dense 3D Modeling of Indoor Environments. International Journal of Robotics Research, 2014.

11.

Pfister

Zwicker

Baar

, et al. Surfels: Surface Elements as Rendering Primitives. ACM Press/Addison-Wesley Publishing Co. 2000.

12.

Milios

. Globally consistent range scan alignment for environment mapping. Autonomous Robots. 1997; 4(4): 333-349.

13.

Konolige

. Large-Scale Map-Making. Proceedings of the Nineteenth National Conference on Artificial Intelligence, Sixteenth Conference on Innovative Applications of Artificial Intelligence, July 25–29, 2004, San Jose, California, USA. DBLP, 2004.

14.

Thrun, Sebastian. Probabilistic robotics. Communications of the Acm, 2005; 45(3): 52-57.

15.

Grisetti

Grzonka

Stachniss

, et al. Efficient estimation of accurate maximum likelihood maps in 3D. Intelligent Robots and Systems, 2007. IROS 2007. IEEE/RSJ International Conference on. IEEE, 2007.

16.

Lourakis

Argyros

. SBA: A software package for generic sparse bundle adjustment. ACM Transactions on Mathematical Software. 2010; 36(1): p.2.1-2.30.

17.

Endres

Hess

Sturm

, et al. 3-D mapping with an RGB-D camera. IEEE Transactions on Robotics. 2014; 30(1): 177-187.

18.

Lindeberg

. Scale Invariant Feature Transform. 2012.

19.

Bay

Tuytelaars

Gool

. SURF: Speeded up robust features. Springer-Verlag, 2006.

20.

Kummerle

Grisetti

Strasdat

, et al. G2o: A general framework for graph optimization. IEEE International Conference on Robotics & Automation. IEEE, 2011.

21.

Audras

Comport

Meilland

, et al. Real-time dense appearance-based SLAM for RGB-D sensors. Proceedings of the 2011 Australasian Conference on Robotics and Automation, 2011.

22.

Steinbrucker

Sturm

Cremers

. Real-time visual odometry from dense RGB-D images. ICCV Workshop on Live Dense Reconstruction with Moving Cameras. 2011: 719-722.

23.

Newcombe

Izadi

Hilliges

, et al. KinectFusion: Real-time dense surface mapping and tracking. IEEE International Symposium on Mixed & Augmented Reality. IEEE, 2012.

24.

Whelan

Kaess

Fallon

, et al. Kintinuous: Spatially Extended KinectFusion. RSS Workshop on RGB-D: Advanced Reasoning with Depth Cameras. 2012.

25.

Whelan

Johannsson

Kaess

, et al. Robust real-time visual odometry for dense RGB-D mapping. 2013 IEEE International Conference on Robotics and Automation. IEEE, 2013.

26.

Whelan

Kaess

Leonard

, et al. Deformation-based loop closure for large scale dense RGB-D SLAM. 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2014.

27.

Whelan

Kaess

Johannsson

, et al. Real-time large-scale dense RGB-D SLAM with volumetric fusion. The International Journal of Robotics Research, 2015.

28.

Caudell

. Augumented reality: an application of heads-up display technology to manual manufactureing processes. Proc. hawaii International Conf. on Systems Science. 1992; 2.

29.

Azuma Ronald

. A survey of augmented reality. Presence: Teleoperators & Virtual Environments. 1997.

30.

Hou

Han

Zhang

Zhu

. Survey of Vision-Based Augmented Reality 3D Registration Technology. Journal of System Simulation. 2019; 31(11): 2206-2215. [in Chinese].

31.

Nee

Ong

Chryssolouris

, et al. Augmented reality applications in design and manufacturing. CIRP Annals. 2012; 61(2).

32.

Fiala

. Designing highly reliable fiducial markers. IEEE Transactions on Pattern Analysis & Machine Intelligence. 2010; 32(7): 1317-24.

33.

Kato

. Marker Tracking and HMD Calibration for a Video-based Augmented Reality Conferencing System. Iwar, 1999.

34.

Fiala

. ARTag, a fiducial marker system using digital techniques. 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05). IEEE, 2005.

35.

Neumann

You

. Natural feature tracking for augmented reality. IEEE Transactions on Multimedia. 1999; 1(1): 53-64.

36.

Rosten

. Machine learning for high-speed corner detection. Springer-Verlag, 2006.

37.

Rublee

Rabaud

Konolige

, et al. ORB: an efficient alternative to SIFT or SURF. IEEE International Conference on Computer Vision, ICCV 2011, Barcelona, Spain, November 6–13, 2011. IEEE, 2011.

38.

Leutenegger

Chli

Siegwart

. BRISK: Binary Robust invariant scalable keypoints. International Conference on Computer Vision. IEEE, 2011.

39.

Carmigniani

Furht

Anisetti

, et al. Augmented reality technologies, systems and applications. Multimedia Tools & Applications. 2011; 51(1): 341-377.

40.

Vial

. Adaptive line tracking with multiple hypotheses for augmented reality. Ismar05 Washington Dc Usa, 2005.

41.

Reitmayr

Drummond

. Going out: robust model-based tracking for outdoor augmented reality. In Proc. Of International Symposium on Mixed and Augmented Reality (ISMAR ’06). 2006.

42.

. Research on Key Techniques of RGB-D Camera Based Augmented Reality System. Zhejing University, 2017.

43.

Liu

. Robust and Efficient 3D Registration and Structure Recovery for Challenging Environment. Zhejiang University, 2017.

44.

Kerl

Sturm

Cremers

. Robust Odometry Estimation for RGB-D Cameras. Robotics and Automation (ICRA), 2013 IEEE International Conference on. IEEE, 2013.

45.

Kerl

Sturm

Cremers

. Dense visual SLAM for RGB-D cameras. Intelligent Robots and Systems (IROS), 2013 IEEE/RSJ International Conference on. IEEE, 2014.

46.

Matthias Nießner, Michael Zollhöfer Izadi

, et al. Real-time 3D Reconstruction at Scale using Voxel Hashing. ACM Transactions on Graphics (TOG). 2013; 32(6CD): 1-11.

47.

Kahler

Prisacariu

Ren

, et al. Very high frame rate volumetric integration of depth images on mobile devices. IEEE Transactions on Visualization & Computer Graphics. 2015; 21(11): 1241.

48.

Harris

Sengupta

Owens

. Parallel prefix sum (scan) with CUDA. 2007.

49.

Gobbetti

Marton

Guitian

. A single-pass GPU ray casting framework for interactive out-of-core rendering of massive volumetric datasets. Visual Computer. 2008; 24(7-9): 797-806.

50.

Schops

Engel

Cremers

. Semi-dense visual odometry for AR on a smartphone. IEEE International Symposium on Mixed & Augmented Reality. IEEE, 2014.